Data Science Application for Election Forecasting
Presented By: Dr. Ratnesh Prasad Srivastava, CSIT, GGV, C.G.
Create realistic exit poll datasets for analysis and model training using statistical sampling methods.
For each stratum, sample size is calculated as:
\[ n_h = N_h \times \frac{n}{N} \]
Where:
\[ \hat{p} = \frac{1}{n} \sum_{h=1}^{H} \sum_{i=1}^{n_h} y_{hi} \]
Where \( y_{hi} \) is the response of the i-th unit in the h-th stratum.
| State | Age | Income | Education | Vote |
|---|---|---|---|---|
| No data generated yet | ||||
The sampling distribution of the proportion follows a normal distribution:
\[ \hat{p} \sim N\left(p, \frac{p(1-p)}{n}\right) \]
Where \( p \) is the true population proportion and \( n \) is the sample size.
Explore how different demographic factors influence voting behavior using statistical methods.
No analysis performed yet
Predicted Seats: NDA: 295 | UPA: 145 | Others: 103
Seat Prediction Model: \[ \text{Seats} = \beta_0 + \beta_1 \times \text{Vote\%} + \beta_2 \times \text{Margin} + \beta_3 \times \text{Alliance} \]
Multiple regression model for voting behavior:
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \]
Where:
| Coefficient | Estimate | Std. Error | t-value | p-value |
|---|---|---|---|---|
| β₀ (Intercept) | 0.24 | 0.03 | 8.00 | < 0.001 |
| β₁ (Income) | 0.32 | 0.05 | 6.40 | < 0.001 |
| β₂ (Education) | 0.18 | 0.04 | 4.50 | < 0.001 |
| β₃ (Age) | -0.15 | 0.06 | -2.50 | 0.012 |
Model fit: R² = 0.67, Adjusted R² = 0.65, F-statistic = 48.3 (p < 0.001)
Comprehensive overview of statistical and machine learning approaches for exit poll prediction.
Comprehensive step-by-step methodology for conducting exit poll analysis using data science approaches.
This critical initial phase sets the foundation for the entire exit poll operation:
Sample Size Calculation:
\[ n = \frac{z^2 \times p(1-p)}{e^2} \]
Where:
For a 95% confidence level and 3% margin of error: \[ n = \frac{1.96^2 \times 0.5(1-0.5)}{0.03^2} = 1067 \]
Rigorous data collection protocols ensure data quality and reliability:
| Data Quality Check | Methodology | Acceptance Criteria |
|---|---|---|
| Response Rate Monitoring | Track completed vs attempted interviews | > 70% response rate |
| Data Validation | Range checks, consistency validation | < 5% data errors |
| Timeliness | Time from collection to processing | < 2 hours during polling |
| Completeness | Percentage of completed questionnaires | > 95% complete records |
Comprehensive EDA reveals patterns and informs modeling strategies:
Demographic Analysis:
\[ \text{Vote Share by Group} = \frac{\sum \text{Votes for Party in Group}}{\sum \text{Total Voters in Group}} \times 100\% \]
Cross-tabulation Analysis:
\[ \chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]
Where \( O_{ij} \) is the observed frequency and \( E_{ij} \) is the expected frequency for cell (i,j)
Advanced statistical models transform raw data into accurate predictions:
Multilevel Regression with Post-stratification (MRP):
\[ \text{Pr}(y_i = 1) = \text{logit}^{-1}(\alpha^{state[j]} + \beta^{age[j]} + \gamma^{education[j]} + \delta^{income[j]}) \]
Where parameters vary by demographic group and are estimated using hierarchical modeling.
Seat Prediction Model:
\[ \text{Seats}_p = \sum_{c=1}^{C} \text{Pr}(\text{win}_c) \]
Where the probability of winning each constituency is modeled based on historical patterns and current vote share estimates.
Effective communication of results with proper uncertainty quantification:
Uncertainty Estimation:
\[ \text{Prediction Interval} = \hat{y} \pm t_{\alpha/2, n-2} \times s \times \sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{\sum(x_i - \bar{x})^2}} \]
Model Performance Metrics:
\[ \text{MAPE} = \frac{100\%}{n} \sum_{i=1}^{n} \left| \frac{A_i - F_i}{A_i} \right| \]
Where MAPE is Mean Absolute Percentage Error, \( A_i \) is actual value, and \( F_i \) is forecasted value.
Our comprehensive QA framework ensures reliable and accurate results:
| QA Component | Methods | Frequency |
|---|---|---|
| Field Supervision | Random spot checks, supervisor validation | Ongoing during data collection |
| Data Validation | Automated checks, outlier detection | Real-time during data entry |
| Model Validation | Cross-validation, back-testing | Before finalizing predictions |
| Result Verification | Comparison with actual results, error analysis | Post-election |
We adhere to strict ethical guidelines throughout our analytical process:
We employ cutting-edge data science methods for enhanced accuracy:
\[ y_i \sim \text{Bernoulli}(p_i) \]
\[ \text{logit}(p_i) = \alpha + \beta_{state[i]} + \gamma_{demographic[i]} \]
Allows for partial pooling and better uncertainty quantification
\[ \hat{y} = \sum_{m=1}^{M} w_m \hat{y}_m \]
Combines multiple models to improve prediction accuracy and robustness
\[ y_t = \beta_0 + \beta_1 t + \beta_2 y_{t-1} + \epsilon_t \]
Models trends and patterns across multiple election cycles
Addressing real-world challenges in exit poll analytics:
| Challenge | Impact | Our Solution |
|---|---|---|
| Non-response Bias | Systematic differences between respondents and non-respondents | Statistical weighting, propensity score adjustment |
| Small Sample Sizes in Subgroups | High variance for demographic subgroup estimates | Hierarchical modeling, partial pooling |
| Last-minute Voting Decisions | Response inaccuracy for undecided voters | Probabilistic modeling, uncertainty quantification |
| Geographical Heterogeneity | Different voting patterns across regions | Multilevel modeling, regional stratification |
This iterative process ensures continuous enhancement of our analytical approaches
Our approach uses stratified multistage sampling to ensure representative coverage across India's diverse electorate.
Exit polls in India present unique challenges due to the country's size, diversity, and complex electoral process. Our methodology is designed to capture accurate voting patterns while maintaining statistical rigor.
We employ a stratified multistage random sampling approach specifically designed for Indian elections:
We stratify constituencies based on:
From each stratum, we randomly select constituencies proportionally to the number of seats in that stratum.
Within each selected constituency, we randomly select polling stations considering:
Typically, we select 4-6 polling stations per constituency.
At each polling station, our field investigators use systematic random sampling:
This approach minimizes selection bias and ensures a representative sample.
For national exit polls in India, we typically aim for a sample size of 100,000-150,000 respondents:
| Election Type | Target Sample Size | Number of States Covered | Polling Stations Covered | Margin of Error |
|---|---|---|---|---|
| Lok Sabha (National) | 100,000-150,000 | 25-30 | 3,500-4,500 | ±3% at national level |
| State Assembly | 15,000-25,000 | 1 (the state) | 500-800 | ±3-5% at state level |
| By-election | 2,000-5,000 | 1 constituency | 50-80 | ±5-7% at constituency level |
Our field operations follow a strict protocol:
Our exit poll questionnaire is carefully designed to:
To ensure data quality, we implement several measures:
Conducting exit polls in India presents unique challenges:
After data collection, we apply statistical weights to correct for:
We use demographic data from the Election Commission and census to create post-stratification weights.
The weight for each respondent is calculated as:
\[ w_i = \frac{\text{Proportion in population}}{\text{Proportion in sample}} \]
Where the proportions are based on demographic characteristics like age, gender, caste, and region.
We adhere to strict ethical guidelines in our exit polling:
| Sampling Method | Description | Advantages | Disadvantages | Use Case in Exit Polls |
|---|---|---|---|---|
| Simple Random Sampling | Every member of the population has an equal chance of being selected | Unbiased, easy to implement | May not represent subgroups well, inefficient for large populations | Rarely used alone due to India's diversity |
| Stratified Sampling | Population divided into homogeneous subgroups (strata), then random sampling within each | Ensures representation of all subgroups, improves precision | Requires accurate stratification variables | Primary method for ensuring regional and demographic representation |
| Cluster Sampling | Population divided into clusters, random selection of clusters, then sample all or some units within clusters | Cost-effective, practical for large geographical areas | Higher sampling error than simple random sampling | Used for selecting polling stations within constituencies |
| Systematic Sampling | Selecting every kth element from a list after a random start | Easy to implement, evenly spread across population | Vulnerable to periodicity in the list | Used within selected clusters for voter selection |
| Multistage Sampling | Combination of multiple sampling methods | Flexible, cost-effective, practical for large populations | Complex design, potential for accumulated errors | Our primary approach: states → constituencies → polling stations → voters |
The sample size for each stratum is determined using the formula:
\[ n = \frac{N \cdot z^2 \cdot p(1-p)}{e^2(N-1) + z^2 \cdot p(1-p)} \]
Where:
The margin of error for a proportion is calculated as:
\[ MOE = z \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]
Where \( \hat{p} \) is the sample proportion.
When sampling without replacement from a finite population, we apply the finite population correction:
\[ MOE_{fpc} = MOE \cdot \sqrt{\frac{N - n}{N - 1}} \]
This reduces the margin of error when the sample size is large relative to the population.
For a sample proportion of 45% with a margin of error of ±3%:
We stratify our sampling based on:
We employ advanced statistical methods to make inferences about population parameters from sample data.
For proportion estimates, we calculate confidence intervals using:
\[ CI = \hat{p} \pm z \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]
Where \( \hat{p} \) is the sample proportion, \( z \) is the z-score for the desired confidence level, and \( n \) is the sample size.
The margin of error (MOE) represents the radius of the confidence interval:
\[ MOE = z \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]
For a 95% confidence level (z = 1.96), sample proportion of 0.5, and sample size of 1000:
\[ MOE = 1.96 \cdot \sqrt{\frac{0.5 \cdot 0.5}{1000}} = 0.031 \text{ or } ±3.1\% \]
This means we can be 95% confident that the true population proportion lies within ±3.1% of our sample proportion.
The margin of error depends on three main factors:
\[ MOE \propto \frac{1}{\sqrt{n}} \]
To halve the margin of error, we need to quadruple the sample size:
\[ MOE_{\text{new}} = \frac{MOE_{\text{original}}}{2} \Rightarrow n_{\text{new}} = 4 \cdot n_{\text{original}} \]
We use Bayesian methods to update our predictions as new data arrives:
\[ P(H|D) = \frac{P(D|H) \cdot P(H)}{P(D)} \]
Where:
We test various hypotheses about voting patterns:
\[ Z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}} \]
For comparing proportions between two groups, where \( \hat{p} = \frac{x_1 + x_2}{n_1 + n_2} \).
In hypothesis testing, we consider:
In exit polls, we typically set α = 0.05, meaning we accept a 5% chance of incorrectly concluding a difference exists.
Understanding typical voting patterns using measures of central tendency.
# Central Tendency Analysis
import numpy as np
from scipy import stats
# Sample data: vote percentages for a party across constituencies
vote_percentages = [45, 52, 38, 48, 55, 42, 47, 51, 44, 49]
print("Vote Distribution Analysis for Party A")
print("=" * 40)
# Arithmetic Mean
mean = np.mean(vote_percentages)
print(f"Arithmetic Mean: {mean:.2f}%")
# Median
median = np.median(vote_percentages)
print(f"Median: {median:.2f}%")
# Mode
mode = stats.mode(vote_percentages)
print(f"Mode: {mode.mode[0]:.2f}% (appeared {mode.count[0]} times)")
# Geometric Mean (useful for proportional data)
geometric_mean = stats.gmean(vote_percentages)
print(f"Geometric Mean: {geometric_mean:.2f}%")
# Harmonic Mean (useful for rates)
harmonic_mean = stats.hmean(vote_percentages)
print(f"Harmonic Mean: {harmonic_mean:.2f}%")
# Output explanation
print(f"\nInterpretation: The arithmetic mean (47.10%) is slightly higher than")
print(f"the geometric mean (46.84%) and harmonic mean (46.53%), indicating")
print(f"some right-skewness in the distribution. The median (47.50%) is close")
print(f"to the mean, suggesting a relatively symmetric distribution.")
The arithmetic mean is calculated as:
\[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]
Where \(x_i\) represents each data point and \(n\) is the number of observations.
For our data: [45, 52, 38, 48, 55, 42, 47, 51, 44, 49]
\[ \bar{x} = \frac{45 + 52 + 38 + 48 + 55 + 42 + 47 + 51 + 44 + 49}{10} = \frac{471}{10} = 47.1 \]
The geometric mean is calculated as:
\[ G = \sqrt[n]{\prod_{i=1}^{n} x_i} \]
For our data:
\[ G = \sqrt[10]{45 \times 52 \times 38 \times 48 \times 55 \times 42 \times 47 \times 51 \times 44 \times 49} \]
\[ G \approx \sqrt[10]{5.67 \times 10^{16}} \approx 46.84 \]
The geometric mean is useful for proportional data as it is less affected by extreme values.
The harmonic mean is calculated as:
\[ H = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}} \]
For our data:
\[ H = \frac{10}{\frac{1}{45} + \frac{1}{52} + \frac{1}{38} + \frac{1}{48} + \frac{1}{55} + \frac{1}{42} + \frac{1}{47} + \frac{1}{51} + \frac{1}{44} + \frac{1}{49}} \]
\[ H \approx \frac{10}{0.2150} \approx 46.53 \]
The harmonic mean is appropriate for averaging rates because it gives equal weight to each data point.
The relationship between the different means tells us about the distribution of our data:
\[ \text{Arithmetic Mean} > \text{Geometric Mean} > \text{Harmonic Mean} \]
This relationship always holds for positive data with variability, indicating our data has some right-skewness.
The close proximity of the median (47.50) to the arithmetic mean (47.10) suggests the distribution is relatively symmetric despite the slight skewness.
Analyzing vote consistency across regions using measures of variability.
# Measures of Dispersion
import numpy as np
# Sample data: vote percentages for a party across constituencies
vote_percentages = [45, 52, 38, 48, 55, 42, 47, 51, 44, 49]
print("Dispersion Analysis for Party A Votes")
print("=" * 40)
# Variance
variance = np.var(vote_percentages)
print(f"Variance: {variance:.2f}")
# Standard Deviation
std_dev = np.std(vote_percentages)
print(f"Standard Deviation: {std_dev:.2f}%")
# Range
data_range = np.ptp(vote_percentages) # Peak to peak (max - min)
print(f"Range: {data_range}%")
# Interquartile Range (IQR)
q75, q25 = np.percentile(vote_percentages, [75, 25])
iqr = q75 - q25
print(f"Interquartile Range (IQR): {iqr:.2f}%")
# Output explanation
print(f"\nInterpretation: The standard deviation of {std_dev:.2f}% indicates")
print(f"moderate variability in vote percentages across polling stations.")
print(f"The IQR of {iqr:.2f}% shows that the middle 50% of polling stations")
print(f"have vote percentages between {q25:.2f}% and {q75:.2f}%.")
Variance measures the average squared deviation from the mean:
\[ \sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n} \]
Where \(x_i\) represents each data point, \(\bar{x}\) is the mean, and \(n\) is the number of observations.
For our data with mean = 47.1:
\[ \sigma^2 = \frac{(45-47.1)^2 + (52-47.1)^2 + \cdots + (49-47.1)^2}{10} \]
\[ \sigma^2 = \frac{(-2.1)^2 + (4.9)^2 + (-9.1)^2 + (0.9)^2 + (7.9)^2 + (-5.1)^2 + (-0.1)^2 + (3.9)^2 + (-3.1)^2 + (1.9)^2}{10} \]
\[ \sigma^2 = \frac{4.41 + 24.01 + 82.81 + 0.81 + 62.41 + 26.01 + 0.01 + 15.21 + 9.61 + 3.61}{10} = \frac{229.9}{10} = 22.99 \]
Standard deviation is the square root of variance:
\[ \sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n}} \]
For our data:
\[ \sigma = \sqrt{22.99} \approx 4.79 \]
This tells us that vote percentages typically vary by about 4.79% from the mean value.
IQR measures the spread of the middle 50% of data:
\[ \text{IQR} = Q_3 - Q_1 \]
Where \(Q_1\) is the 25th percentile and \(Q_3\) is the 75th percentile.
For our sorted data: [38, 42, 44, 45, 47, 48, 49, 51, 52, 55]
\[ Q_1 = 44.25 \quad (\text{using linear interpolation}) \]
\[ Q_3 = 50.75 \quad (\text{using linear interpolation}) \]
\[ \text{IQR} = 50.75 - 44.25 = 6.5 \]
This means the middle 50% of polling stations have vote percentages within a range of 6.5%.
The standard deviation of 4.79% indicates moderate variability. In exit poll analysis:
The IQR of 6.5% tells us that half of all polling stations have vote percentages between 44.25% and 50.75%, which is a relatively narrow range, indicating consistency in most regions.
Analyzing relationship between income levels and voting patterns.
# Correlation Analysis
import numpy as np
# Sample data: income (in thousands) and vote percentage for a party
income = [35, 42, 28, 55, 62, 38, 45, 51, 33, 48]
vote_percent = [45, 52, 38, 48, 55, 42, 47, 51, 44, 49]
print("Correlation between Income and Vote Percentage")
print("=" * 55)
# Covariance
covariance = np.cov(income, vote_percent)[0, 1]
print(f"Covariance: {covariance:.2f}")
# Pearson Correlation Coefficient
correlation = np.corrcoef(income, vote_percent)[0, 1]
print(f"Pearson's r: {correlation:.3f}")
# Interpretation
if correlation > 0.7:
strength = "strong positive"
elif correlation > 0.3:
strength = "moderate positive"
elif correlation > -0.3:
strength = "weak or no"
elif correlation > -0.7:
strength = "moderate negative"
else:
strength = "strong negative"
print(f"\nInterpretation: {strength} correlation between income and vote percentage.")
# Additional insights
if correlation > 0:
print("As income increases, vote percentage for Party A tends to increase.")
else:
print("As income increases, vote percentage for Party A tends to decrease.")
Covariance measures how two variables change together:
\[ \text{Cov}(X,Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{n} \]
Where \(x_i\) and \(y_i\) are data points, \(\bar{x}\) and \(\bar{y}\) are means.
For our data:
\[ \bar{x} = 43.7 \quad (\text{mean income}) \]
\[ \bar{y} = 47.1 \quad (\text{mean vote percentage}) \]
\[ \text{Cov}(X,Y) = \frac{(35-43.7)(45-47.1) + (42-43.7)(52-47.1) + \cdots + (48-43.7)(49-47.1)}{10} \]
\[ \text{Cov}(X,Y) = \frac{(-8.7)(-2.1) + (-1.7)(4.9) + \cdots + (4.3)(1.9)}{10} \]
\[ \text{Cov}(X,Y) = \frac{18.27 - 8.33 + \cdots + 8.17}{10} = \frac{64.1}{10} = 6.41 \]
Pearson's r standardizes covariance to a range between -1 and 1:
\[ r = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} \]
Where \(\sigma_X\) and \(\sigma_Y\) are standard deviations of X and Y.
For our data:
\[ \sigma_X = 9.63 \quad (\text{std dev of income}) \]
\[ \sigma_Y = 4.79 \quad (\text{std dev of vote percentage}) \]
\[ r = \frac{6.41}{9.63 \times 4.79} = \frac{6.41}{46.13} \approx 0.139 \]
This indicates a weak positive correlation between income and vote percentage.
Degrees of freedom (df) in correlation analysis represent the number of independent pieces of information available to estimate the relationship between variables.
For Pearson correlation, degrees of freedom is calculated as:
\[ df = n - 2 \]
Where \(n\) is the number of paired observations.
In our case with 10 data points:
\[ df = 10 - 2 = 8 \]
We subtract 2 because we've estimated two parameters from the data (the means of X and Y). These estimated parameters place constraints on the data, reducing the number of independent pieces of information.
Degrees of freedom are crucial for determining the statistical significance of the correlation coefficient and for calculating confidence intervals.
The correlation coefficient (r = 0.139) suggests a weak positive relationship:
In exit poll analysis, this means that while there might be a slight tendency for higher income areas to vote more for Party A, income is not a strong predictor of voting behavior.
Other factors (age, education, geographic location) likely play more significant roles in determining voting patterns.
To determine if this correlation is statistically significant, we can calculate the t-statistic:
\[ t = r \sqrt{\frac{n-2}{1-r^2}} \]
Where n is the sample size (10 in our case).
\[ t = 0.139 \times \sqrt{\frac{8}{1-0.019}} = 0.139 \times \sqrt{\frac{8}{0.981}} = 0.139 \times \sqrt{8.155} = 0.139 \times 2.856 \approx 0.397 \]
With 8 degrees of freedom, this t-value is not statistically significant (p > 0.05), meaning we cannot reject the null hypothesis that there is no correlation between income and voting patterns.
Multivariate analysis of polling data using matrix operations.
# Matrix Operations for Multivariate Analysis
import numpy as np
# Create a data matrix: rows = constituencies, columns = variables
# Variables: vote percentage, median income, median age, education index
data_matrix = np.array([
[45, 35, 42, 0.65], # Constituency 1
[52, 42, 38, 0.72], # Constituency 2
[38, 28, 51, 0.58], # Constituency 3
[48, 55, 45, 0.81], # Constituency 4
[55, 62, 39, 0.78] # Constituency 5
])
print("Data Matrix (5 constituencies × 4 variables):")
print(data_matrix)
# Row operation: Normalize each row (constituency) by its total
row_sums = data_matrix.sum(axis=1)
normalized_by_row = data_matrix / row_sums[:, np.newaxis]
print("\nRow-normalized Matrix (each row sums to 1):")
print(normalized_by_row)
# Column operation: Center the data by subtracting column means
column_means = np.mean(data_matrix, axis=0)
centered_data = data_matrix - column_means
print("\nColumn-centered Matrix (each column mean = 0):")
print(centered_data)
# Calculate covariance matrix
covariance_matrix = np.cov(centered_data, rowvar=False)
print("\nCovariance Matrix:")
print(covariance_matrix)
# Calculate correlation matrix
correlation_matrix = np.corrcoef(centered_data, rowvar=False)
print("\nCorrelation Matrix:")
print(correlation_matrix)
# Interpretation
print("\nInterpretation: The covariance matrix shows how variables vary together.")
print("The correlation matrix shows standardized relationships between variables.")
print("Values close to 1 or -1 indicate strong relationships.")
Our data matrix represents 5 constituencies with 4 variables each:
| 45 | 35 | 42 | 0.65 |
| 52 | 42 | 38 | 0.72 |
| 38 | 28 | 51 | 0.58 |
| 48 | 55 | 45 | 0.81 |
| 55 | 62 | 39 | 0.78 |
This matrix format allows us to perform efficient multivariate analysis.
Row normalization converts each row to sum to 1:
\[ \text{For each row } i, \quad x_{ij}^{\text{norm}} = \frac{x_{ij}}{\sum_{j=1}^{p} x_{ij}} \]
This is useful for comparing patterns across constituencies with different sizes.
For the first row: [45, 35, 42, 0.65] with sum = 122.65
Normalized: [45/122.65, 35/122.65, 42/122.65, 0.65/122.65] ≈ [0.367, 0.285, 0.342, 0.005]
Column centering subtracts the column mean from each value:
\[ x_{ij}^{\text{centered}} = x_{ij} - \bar{x}_j \]
Where \(\bar{x}_j\) is the mean of column j.
This transformation is essential for covariance and correlation calculations.
The covariance matrix is calculated as:
\[ \Sigma = \frac{1}{n-1} X^T X \]
Where X is the centered data matrix and n is the number of observations.
This matrix shows how variables vary together. Diagonal elements are variances, and off-diagonal elements are covariances.
For our centered data, the covariance matrix would be:
| Var(X₁) | Cov(X₁,X₂) | Cov(X₁,X₃) | Cov(X₁,X₄) |
| Cov(X₂,X₁) | Var(X₂) | Cov(X₂,X₃) | Cov(X₂,X₄) |
| Cov(X₃,X₁) | Cov(X₃,X₂) | Var(X₃) | Cov(X₃,X₄) |
| Cov(X₄,X₁) | Cov(X₄,X₂) | Cov(X₄,X₃) | Var(X₄) |
The correlation matrix is derived from the covariance matrix:
\[ \rho_{ij} = \frac{\sigma_{ij}}{\sigma_i \sigma_j} \]
Where \(\sigma_{ij}\) is the covariance between variables i and j, and \(\sigma_i\), \(\sigma_j\) are their standard deviations.
Correlation values range from -1 to 1, indicating the strength and direction of relationships.
For example, if we have:
\[ \sigma_{12} = 25.5 \quad (\text{covariance between vote % and income}) \]
\[ \sigma_1 = 6.8 \quad (\text{std dev of vote %}) \]
\[ \sigma_2 = 12.3 \quad (\text{std dev of income}) \]
Then the correlation would be:
\[ \rho_{12} = \frac{25.5}{6.8 \times 12.3} \approx \frac{25.5}{83.64} \approx 0.305 \]
This indicates a moderate positive correlation between vote percentage and income.
The correlation matrix standardizes the covariance matrix, making it easier to compare relationships between variables with different scales.
Analyzing relationship between education level and voting preference.
# Cross-Tabulation and Chi-Square Test
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
# Sample data: education level (1=Low, 2=Medium, 3=High) and vote choice (1=Party A, 2=Party B)
education_level = [1, 2, 3, 2, 3, 1, 2, 3, 3, 2,
1, 2, 3, 2, 1, 3, 2, 3, 1, 2]
vote_choice = [1, 2, 2, 1, 2, 1, 1, 2, 2, 2,
1, 1, 2, 2, 1, 2, 1, 2, 1, 2]
print("Cross-Tabulation of Education Level and Vote Choice")
print("=" * 55)
# Create a cross-tabulation (contingency table)
contingency_table = pd.crosstab(education_level, vote_choice,
rownames=['Education Level'],
colnames=['Party'])
print("Contingency Table:")
print(contingency_table)
# Perform Chi-Square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"\nChi-Square Test Results:")
print(f"Chi2 statistic: {chi2:.3f}")
print(f"P-value: {p_value:.6f}")
print(f"Degrees of freedom: {dof}")
print("Expected frequencies: \n", expected)
# Interpret results
alpha = 0.05
if p_value <= alpha:
print("\nThere is a significant relationship between education level and vote choice.")
else:
print("\nThere is no significant relationship between education level and vote choice.")
# Calculate Cramer's V for effect size
n = np.sum(contingency_table.values)
min_dim = min(contingency_table.shape) - 1
cramers_v = np.sqrt(chi2 / (n * min_dim))
print(f"\nEffect size (Cramer's V): {cramers_v:.3f}")
if cramers_v < 0.1:
effect_strength = "weak"
elif cramers_v < 0.3:
effect_strength = "moderate"
else:
effect_strength = "strong"
print(f"This indicates a {effect_strength} relationship between education level and voting preference.")
A contingency table shows the frequency distribution of variables:
| Party A | Party B | Total | |
| Low Education | 4 | 2 | 6 |
| Medium Education | 4 | 6 | 10 |
| High Education | 1 | 3 | 4 |
| Total | 9 | 11 | 20 |
This table shows the relationship between education level and voting preference.
The Chi-Square test determines if there's a significant association between categorical variables:
\[ \chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]
Where \(O_{ij}\) is the observed frequency and \(E_{ij}\) is the expected frequency under the null hypothesis of no association.
Expected frequencies are calculated as:
\[ E_{ij} = \frac{(\text{row total}_i) \times (\text{column total}_j)}{n} \]
For example, for Low Education and Party A:
\[ E_{11} = \frac{6 \times 9}{20} = \frac{54}{20} = 2.7 \]
These values represent what we would expect if there was no relationship between education and voting preference.
Cramer's V measures the strength of association between nominal variables:
\[ V = \sqrt{\frac{\chi^2}{n \times (k - 1)}} \]
Where n is the total sample size and k is the number of rows or columns, whichever is smaller.
Values range from 0 (no association) to 1 (perfect association).
In our example:
This suggests that while there appears to be a moderate relationship between education and voting preference in our sample, it is not statistically significant due to the small sample size.
Apply machine learning algorithms to predict election outcomes based on exit poll data and demographic factors.
We use advanced machine learning models to predict election outcomes based on exit poll data.
We employ several predictive modeling techniques:
\[ P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \cdots + \beta_nX_n)}} \]
Good for binary classification problems
Ensemble method combining multiple decision trees
Reduces overfitting and improves accuracy
Sequentially builds models to correct errors of previous models
High predictive accuracy
Deep learning models for complex pattern recognition
Can capture nonlinear relationships
We use various metrics to evaluate model performance:
Accuracy: \[ \frac{TP + TN}{TP + TN + FP + FN} \]
Precision: \[ \frac{TP}{TP + FP} \]
Recall: \[ \frac{TP}{TP + FN} \]
F1-Score: \[ 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} \]
Where TP = True Positives, TN = True Negatives, FP = False Positives, FN = False Negatives
We analyze which factors most influence voting behavior:
For tracking changes in voter preferences over time:
ARIMA Model: \[ \Delta^d y_t = c + \phi_1 \Delta^d y_{t-1} + \cdots + \phi_p \Delta^d y_{t-p} + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q} + \varepsilon_t \]
Where ARIMA(p,d,q) represents the order of the autoregressive, integrated, and moving average parts
We combine predictions from multiple models to improve accuracy:
Weighted Average: \[ \hat{y} = \sum_{i=1}^{m} w_i \hat{y}_i \]
Where \( w_i \) are weights assigned to each model's prediction
We use k-fold cross-validation to assess model performance:
\[ CV(k) = \frac{1}{k} \sum_{i=1}^{k} MSE_i \]
Where MSE is the mean squared error for each fold.
Regression models predict continuous values like vote percentage or seat count based on input features.
# Linear Regression for Vote Share Prediction
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Sample data: demographic features and vote share
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48, 29, 36, 41, 53, 59],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19, 10, 13, 16, 18, 20],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41, 49, 38, 43, 47, 40],
'previous_vote': [45, 52, 38, 48, 55, 42, 47, 51, 44, 49, 40, 46, 50, 53, 56],
'vote_share': [48, 55, 42, 52, 58, 45, 50, 54, 46, 53, 43, 49, 52, 56, 59]
}
df = pd.DataFrame(data)
# Prepare features and target
X = df[['income', 'education', 'age', 'previous_vote']]
y = df['vote_share']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Linear Regression Results:")
print("==========================")
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")
# Predict for new data
new_data = pd.DataFrame({
'income': [40, 50],
'education': [15, 18],
'age': [45, 42],
'previous_vote': [47, 52]
})
predictions = model.predict(new_data)
print(f"\nPredictions for new data: {predictions}")
# Plot actual vs predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.xlabel('Actual Vote Share')
plt.ylabel('Predicted Vote Share')
plt.title('Linear Regression: Actual vs Predicted Vote Share')
plt.show()
Linear regression models the relationship between a dependent variable and one or more independent variables:
\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_nx_n + \epsilon \]
Where:
The coefficients are estimated by minimizing the sum of squared residuals:
\[ \min_{\beta} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
Where \( \hat{y}_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \cdots + \beta_nx_{in} \)
The solution is given by:
\[ \hat{\beta} = (X^T X)^{-1} X^T y \]
Where \( X \) is the design matrix and \( y \) is the response vector.
Mean Squared Error (MSE):
\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
R-squared (Coefficient of Determination):
\[ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} \]
Where \( \bar{y} \) is the mean of the observed data.
In our example:
For election forecasting, we might find that:
Linear regression has few hyperparameters to tune:
# Polynomial Regression for Vote Share Prediction
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Sample data
data = {
'campaign_spending': [10, 15, 8, 20, 25, 12, 18, 22, 9, 28, 11, 16, 19, 24, 30],
'vote_share': [45, 52, 42, 58, 62, 47, 55, 59, 43, 65, 44, 53, 56, 61, 68]
}
df = pd.DataFrame(data)
# Prepare features and target
X = df[['campaign_spending']]
y = df['vote_share']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create polynomial regression model
degree = 3
poly_model = Pipeline([
('poly', PolynomialFeatures(degree=degree)),
('linear', LinearRegression())
])
# Train the model
poly_model.fit(X_train, y_train)
# Make predictions
y_pred = poly_model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Polynomial Regression Results:")
print("==============================")
print(f"Degree: {degree}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")
# Create a range of values for plotting
X_range = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
y_range_pred = poly_model.predict(X_range)
# Plot results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.7, label='Actual Data')
plt.plot(X_range, y_range_pred, 'r-', label=f'Polynomial (Degree {degree})')
plt.xlabel('Campaign Spending (in lakhs)')
plt.ylabel('Vote Share (%)')
plt.title('Polynomial Regression: Campaign Spending vs Vote Share')
plt.legend()
plt.show()
Polynomial regression models the relationship as an nth degree polynomial:
\[ y = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + \cdots + \beta_nx^n + \epsilon \]
This is still a linear model because it's linear in the parameters \( \beta_i \).
Polynomial regression uses basis expansion to transform the features:
\[ \phi(x) = [1, x, x^2, x^3, \ldots, x^n] \]
The model then becomes:
\[ y = \beta_0 + \beta_1\phi_1(x) + \beta_2\phi_2(x) + \cdots + \beta_n\phi_n(x) + \epsilon \]
This allows us to fit nonlinear relationships while still using linear regression techniques.
The degree of the polynomial is a hyperparameter:
We can use cross-validation to select the optimal degree.
Polynomial regression is useful when relationships are nonlinear:
For example, campaign spending might have increasing returns at first but diminishing returns after a certain point.
# Ridge Regression for Vote Share Prediction
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
# Sample data with multiple features
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48, 29, 36, 41, 53, 59],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19, 10, 13, 16, 18, 20],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41, 49, 38, 43, 47, 40],
'previous_vote': [45, 52, 38, 48, 55, 42, 47, 51, 44, 49, 40, 46, 50, 53, 56],
'campaign_spending': [10, 15, 8, 20, 25, 12, 18, 22, 9, 28, 11, 16, 19, 24, 30],
'vote_share': [48, 55, 42, 52, 58, 45, 50, 54, 46, 53, 43, 49, 52, 56, 59]
}
df = pd.DataFrame(data)
# Prepare features and target
X = df.drop('vote_share', axis=1)
y = df['vote_share']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Create and train Ridge regression model
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = ridge_model.predict(X_test_scaled)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Ridge Regression Results:")
print("========================")
print(f"Alpha: {ridge_model.alpha}")
print(f"Coefficients: {ridge_model.coef_}")
print(f"Intercept: {ridge_model.intercept_:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")
# Hyperparameter tuning with GridSearchCV
param_grid = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
grid_search = GridSearchCV(Ridge(), param_grid, cv=5, scoring='r2')
grid_search.fit(X_train_scaled, y_train)
print(f"\nBest alpha: {grid_search.best_params_['alpha']}")
print(f"Best R² score: {grid_search.best_score_:.2f}")
Ridge regression adds L2 regularization to the linear regression cost function:
\[ \min_{\beta} \left( \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} \beta_j^2 \right) \]
Where:
The solution is given by:
\[ \hat{\beta} = (X^T X + \alpha I)^{-1} X^T y \]
Where \( I \) is the identity matrix.
Ridge regression:
The regularization parameter \( \alpha \) controls the trade-off:
We can use cross-validation to find the optimal value of \( \alpha \).
Ridge regression is useful when:
For example, income and education levels are often correlated, and ridge regression can handle this multicollinearity better than ordinary least squares.
# Lasso Regression for Vote Share Prediction
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
# Sample data with multiple features
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48, 29, 36, 41, 53, 59],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19, 10, 13, 16, 18, 20],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41, 49, 38, 43, 47, 40],
'previous_vote': [45, 52, 38, 48, 55, 42, 47, 51, 44, 49, 40, 46, 50, 53, 56],
'campaign_spending': [10, 15, 8, 20, 25, 12, 18, 22, 9, 28, 11, 16, 19, 24, 30],
'social_media_presence': [2, 5, 1, 7, 9, 3, 6, 8, 2, 10, 3, 4, 6, 8, 10],
'vote_share': [48, 55, 42, 52, 58, 45, 50, 54, 46, 53, 43, 49, 52, 56, 59]
}
df = pd.DataFrame(data)
# Prepare features and target
X = df.drop('vote_share', axis=1)
y = df['vote_share']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Create and train Lasso regression model
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = lasso_model.predict(X_test_scaled)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Lasso Regression Results:")
print("========================")
print(f"Alpha: {lasso_model.alpha}")
print(f"Coefficients: {lasso_model.coef_}")
print(f"Intercept: {lasso_model.intercept_:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")
# Check which features were selected (non-zero coefficients)
feature_names = X.columns
selected_features = feature_names[lasso_model.coef_ != 0]
print(f"\nSelected features: {list(selected_features)}")
# Hyperparameter tuning with GridSearchCV
param_grid = {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0]}
grid_search = GridSearchCV(Lasso(), param_grid, cv=5, scoring='r2')
grid_search.fit(X_train_scaled, y_train)
print(f"\nBest alpha: {grid_search.best_params_['alpha']}")
print(f"Best R² score: {grid_search.best_score_:.2f}")
Lasso regression adds L1 regularization to the linear regression cost function:
\[ \min_{\beta} \left( \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} |\beta_j| \right) \]
Where:
Lasso regression has the special property that it can shrink some coefficients to exactly zero:
This is particularly useful when we have many features and want to identify which ones are most predictive.
Similar to ridge regression, we need to choose the regularization parameter \( \alpha \):
Cross-validation is used to find the optimal value of \( \alpha \).
Lasso regression is useful when:
For example, we might start with 20+ demographic and political features, and lasso can help us identify the 5-10 most predictive features for vote share.
# Gradient Descent for Linear Regression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Sample data
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19],
'vote_share': [48, 55, 42, 52, 58, 45, 50, 54, 46, 53]
}
df = pd.DataFrame(data)
# Prepare features and target
X = df[['income', 'education']].values
# Add intercept term (column of ones)
X = np.c_[np.ones(X.shape[0]), X]
y = df['vote_share'].values
# Initialize parameters
theta = np.zeros(X.shape[1])
alpha = 0.01 # Learning rate
iterations = 1000
m = len(y) # Number of training examples
# Cost history to track progress
cost_history = np.zeros(iterations)
# Gradient Descent
for i in range(iterations):
# Calculate predictions
predictions = X.dot(theta)
# Calculate errors
errors = predictions - y
# Calculate gradient
gradient = (1/m) * X.T.dot(errors)
# Update parameters
theta = theta - alpha * gradient
# Calculate cost (MSE)
cost = (1/(2*m)) * np.sum(errors**2)
cost_history[i] = cost
print("Gradient Descent Results:")
print("========================")
print(f"Final parameters: {theta}")
print(f"Final cost: {cost_history[-1]:.4f}")
# Plot cost history
plt.figure(figsize=(10, 6))
plt.plot(range(iterations), cost_history)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.title('Gradient Descent: Cost vs Iterations')
plt.show()
# Make predictions
new_data = np.array([[1, 40, 15], [1, 50, 18]]) # Note the intercept term
predictions = new_data.dot(theta)
print(f"Predictions for new data: {predictions}")
Gradient descent is an optimization algorithm used to minimize the cost function:
\[ J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 \]
Where:
The parameters are updated simultaneously using:
\[ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \]
Where \( \alpha \) is the learning rate.
The partial derivative is:
\[ \frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \]
The learning rate \( \alpha \) determines the step size:
Gradient descent is useful when:
# Maximum Likelihood Estimation for Linear Regression
import numpy as np
import pandas as pd
import scipy.optimize as opt
import matplotlib.pyplot as plt
# Sample data
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19],
'vote_share': [48, 55, 42, 52, 58, 45, 50, 54, 46, 53]
}
df = pd.DataFrame(data)
# Prepare features and target
X = df[['income', 'education']].values
# Add intercept term (column of ones)
X = np.c_[np.ones(X.shape[0]), X]
y = df['vote_share'].values
# Define negative log-likelihood function
def neg_log_likelihood(theta, X, y):
"""Negative log-likelihood for linear regression with normal errors"""
m = len(y)
# Predictions
y_pred = X.dot(theta[:-1]) # theta[:-1] are the coefficients
# Residuals
residuals = y - y_pred
# Variance (last parameter)
sigma_sq = theta[-1]
# Log-likelihood
log_likelihood = -m/2 * np.log(2*np.pi*sigma_sq) - 1/(2*sigma_sq) * np.sum(residuals**2)
return -log_likelihood # Return negative for minimization
# Initial guess (coefficients + variance)
initial_theta = np.zeros(X.shape[1] + 1)
initial_theta[-1] = 1 # Initial variance
# Minimize negative log-likelihood
result = opt.minimize(neg_log_likelihood, initial_theta, args=(X, y), method='BFGS')
# Extract parameters
theta_hat = result.x[:-1] # Coefficient estimates
sigma_sq_hat = result.x[-1] # Variance estimate
print("Maximum Likelihood Estimation Results:")
print("=====================================")
print(f"Coefficient estimates: {theta_hat}")
print(f"Variance estimate: {sigma_sq_hat:.4f}")
print(f"Negative log-likelihood: {result.fun:.4f}")
# Compare with OLS
theta_ols = np.linalg.inv(X.T.dot(X)).dot(X.T.dot(y))
print(f"\nOLS estimates: {theta_ols}")
# Make predictions
new_data = np.array([[1, 40, 15], [1, 50, 18]]) # Note the intercept term
predictions = new_data.dot(theta_hat)
print(f"Predictions for new data: {predictions}")
Maximum likelihood estimation finds parameter values that maximize the likelihood of observing the data:
\[ \mathcal{L}(\theta; y, X) = \prod_{i=1}^{n} f(y_i | x_i; \theta) \]
Where \( f(y_i | x_i; \theta) \) is the probability density function.
For linear regression with normal errors:
\[ y_i | x_i \sim \mathcal{N}(x_i^T \beta, \sigma^2) \]
The likelihood function is:
\[ \mathcal{L}(\beta, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - x_i^T \beta)^2}{2\sigma^2}\right) \]
It's often easier to work with the log-likelihood:
\[ \ell(\beta, \sigma^2) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - x_i^T \beta)^2 \]
Maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood.
For linear regression with normal errors, the maximum likelihood estimates are:
\[ \hat{\beta}_{MLE} = (X^T X)^{-1} X^T y \]
\[ \hat{\sigma}^2_{MLE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - x_i^T \hat{\beta})^2 \]
Note that the MLE of \( \sigma^2 \) is biased (divides by n rather than n-p).
The normal equation solution for linear regression involves several matrix operations:
The design matrix contains the input features with an additional column of ones for the intercept:
\[ X = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix} \]
Where n is the number of observations and p is the number of features.
The transpose operation flips the matrix over its diagonal:
\[ X^T = \begin{bmatrix} 1 & 1 & \cdots & 1 \\ x_{11} & x_{21} & \cdots & x_{n1} \\ x_{12} & x_{22} & \cdots & x_{n2} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1p} & x_{2p} & \cdots & x_{np} \end{bmatrix} \]
This converts the n×(p+1) matrix to a (p+1)×n matrix.
Multiplying Xᵀ by X gives a (p+1)×(p+1) matrix:
\[ X^T X = \begin{bmatrix} n & \sum x_{i1} & \sum x_{i2} & \cdots & \sum x_{ip} \\ \sum x_{i1} & \sum x_{i1}^2 & \sum x_{i1}x_{i2} & \cdots & \sum x_{i1}x_{ip} \\ \sum x_{i2} & \sum x_{i1}x_{i2} & \sum x_{i2}^2 & \cdots & \sum x_{i2}x_{ip} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \sum x_{ip} & \sum x_{i1}x_{ip} & \sum x_{i2}x_{ip} & \cdots & \sum x_{ip}^2 \end{bmatrix} \]
This matrix contains the sums of squares and cross-products of the features.
The inverse of XᵀX is needed to solve the normal equation:
\[ (X^T X)^{-1} \]
This matrix exists if X has full column rank (no perfect multicollinearity).
The inverse represents the precision matrix, which is related to the covariance of the parameter estimates.
Multiplying Xᵀ by the response vector y gives a (p+1)×1 vector:
\[ X^T y = \begin{bmatrix} \sum y_i \\ \sum x_{i1} y_i \\ \sum x_{i2} y_i \\ \vdots \\ \sum x_{ip} y_i \end{bmatrix} \]
This vector contains the sums of cross-products between features and the response.
The normal equation solution is obtained by multiplying (XᵀX)⁻¹ by Xᵀy:
\[ \hat{\beta} = (X^T X)^{-1} X^T y \]
This gives the parameter estimates that minimize the sum of squared errors.
The variance-covariance matrix of the estimates is:
\[ \text{Var}(\hat{\beta}) = \sigma^2 (X^T X)^{-1} \]
Classification algorithms predict categorical outcomes like win/lose or party affiliation based on input features.
# Logistic Regression for Election Outcome Prediction
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
# Sample data: demographic features and election outcome (1 = win, 0 = lose)
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48, 29, 36, 41, 53, 59],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19, 10, 13, 16, 18, 20],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41, 49, 38, 43, 47, 40],
'previous_vote': [45, 52, 38, 48, 55, 42, 47, 51, 44, 49, 40, 46, 50, 53, 56],
'outcome': [1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
# Prepare features and target
X = df.drop('outcome', axis=1)
y = df['outcome']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Create and train logistic regression model
logistic_model = LogisticRegression()
logistic_model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = logistic_model.predict(X_test_scaled)
y_pred_proba = logistic_model.predict_proba(X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print("Logistic Regression Results:")
print("============================")
print(f"Accuracy: {accuracy:.2f}")
print(f"Coefficients: {logistic_model.coef_}")
print(f"Intercept: {logistic_model.intercept_}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
# Predict probabilities for new data
new_data = pd.DataFrame({
'income': [40, 50],
'education': [15, 18],
'age': [45, 42],
'previous_vote': [47, 52]
})
new_data_scaled = scaler.transform(new_data)
predictions = logistic_model.predict_proba(new_data_scaled)
print(f"\nPrediction probabilities for new data: {predictions[:, 1]}")
Logistic regression models the probability that an instance belongs to a particular class:
\[ P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_nx_n)}} \]
Where:
We can transform the probability to log-odds:
\[ \log\left(\frac{P(y=1|x)}{1 - P(y=1|x)}\right) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_nx_n \]
This means the coefficients represent the change in log-odds for a one-unit change in the predictor.
Logistic regression parameters are estimated using maximum likelihood estimation:
\[ \mathcal{L}(\beta) = \prod_{i=1}^{n} P(y_i|x_i)^{y_i} (1 - P(y_i|x_i))^{1-y_i} \]
We maximize the log-likelihood:
\[ \log\mathcal{L}(\beta) = \sum_{i=1}^{n} \left[ y_i \log P(y_i|x_i) + (1-y_i) \log (1 - P(y_i|x_i)) \right] \]
Logistic regression is useful for:
The predicted probabilities can be interpreted as the likelihood of winning, which is more informative than a simple win/lose prediction.
# Random Forest for Election Outcome Prediction
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
# Sample data: demographic features and election outcome (1 = win, 0 = lose)
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48, 29, 36, 41, 53, 59],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19, 10, 13, 16, 18, 20],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41, 49, 38, 43, 47, 40],
'previous_vote': [45, 52, 38, 48, 55, 42, 47, 51, 44, 49, 40, 46, 50, 53, 56],
'urbanization': [75, 85, 45, 90, 95, 65, 80, 88, 50, 92, 40, 70, 82, 89, 87],
'outcome': [1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
# Prepare features and target
X = df.drop('outcome', axis=1)
y = df['outcome']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Create and train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = rf_model.predict(X_test_scaled)
y_pred_proba = rf_model.predict_proba(X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print("Random Forest Results:")
print("=====================")
print(f"Accuracy: {accuracy:.2f}")
print(f"Number of trees: {rf_model.n_estimators}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
# Feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(feature_importance)
# Hyperparameter tuning with GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best accuracy: {grid_search.best_score_:.2f}")
Random Forest is an ensemble learning method that constructs multiple decision trees:
\[ \hat{y} = \text{mode}\{T_1(x), T_2(x), \ldots, T_B(x)\} \]
Where:
Random Forest uses bagging to reduce variance:
This helps reduce overfitting and improves generalization.
At each split in each tree, Random Forest considers only a random subset of features:
\[ m = \sqrt{p} \]
Where p is the total number of features and m is the number of features considered at each split.
This decorrelates the trees and improves model performance.
Random Forest is useful for:
# SVM for Election Outcome Prediction
import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
# Sample data: demographic features and election outcome (1 = win, 0 = lose)
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48, 29, 36, 41, 53, 59],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19, 10, 13, 16, 18, 20],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41, 49, 38, 43, 47, 40],
'previous_vote': [45, 52, 38, 48, 55, 42, 47, 51, 44, 49, 40, 46, 50, 53, 56],
'urbanization': [75, 85, 45, 90, 95, 65, 80, 88, 50, 92, 40, 70, 82, 89, 87],
'outcome': [1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
# Prepare features and target
X = df.drop('outcome', axis=1)
y = df['outcome']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Standardize features (important for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Create and train SVM model
svm_model = SVC(kernel='rbf', probability=True, random_state=42)
svm_model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = svm_model.predict(X_test_scaled)
y_pred_proba = svm_model.predict_proba(X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print("SVM Results:")
print("============")
print(f"Accuracy: {accuracy:.2f}")
print(f"Kernel: {svm_model.kernel}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
# Hyperparameter tuning with GridSearchCV
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [1, 0.1, 0.01, 0.001],
'kernel': ['rbf', 'linear']
}
grid_search = GridSearchCV(SVC(probability=True, random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best accuracy: {grid_search.best_score_:.2f}")
# Make predictions with best model
best_svm = grid_search.best_estimator_
y_pred_best = best_svm.predict(X_test_scaled)
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Best model accuracy: {accuracy_best:.2f}")
Support Vector Machines find the optimal hyperplane that maximizes the margin between classes:
\[ \min_{w,b} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{n} \xi_i \]
Subject to:
\[ y_i(w \cdot x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \]
Where:
SVMs can handle non-linearly separable data using kernel functions:
\[ K(x_i, x_j) = \phi(x_i) \cdot \phi(x_j) \]
Common kernel functions:
Support vectors are the data points that lie closest to the decision boundary:
\[ y_i(w \cdot x_i + b) = 1 \]
These points determine the position and orientation of the hyperplane.
SVMs are useful for:
# Gradient Boosting for Election Outcome Prediction
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
# Sample data: demographic features and election outcome (1 = win, 0 = lose)
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48, 29, 36, 41, 53, 59],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19, 10, 13, 16, 18, 20],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41, 49, 38, 43, 47, 40],
'previous_vote': [45, 52, 38, 48, 55, 42, 47, 51, 44, 49, 40, 46, 50, 53, 56],
'urbanization': [75, 85, 45, 90, 95, 65, 80, 88, 50, 92, 40, 70, 82, 89, 87],
'outcome': [1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
# Prepare features and target
X = df.drop('outcome', axis=1)
y = df['outcome']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Create and train Gradient Boosting model
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb_model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = gb_model.predict(X_test_scaled)
y_pred_proba = gb_model.predict_proba(X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print("Gradient Boosting Results:")
print("=========================")
print(f"Accuracy: {accuracy:.2f}")
print(f"Number of estimators: {gb_model.n_estimators}")
print(f"Learning rate: {gb_model.learning_rate}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
# Feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': gb_model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(feature_importance)
# Hyperparameter tuning with GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7]
}
grid_search = GridSearchCV(GradientBoostingClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best accuracy: {grid_search.best_score_:.2f}")
# Make predictions with best model
best_gb = grid_search.best_estimator_
y_pred_best = best_gb.predict(X_test_scaled)
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Best model accuracy: {accuracy_best:.2f}")
Gradient Boosting builds an ensemble of weak learners (typically decision trees) sequentially:
\[ F_m(x) = F_{m-1}(x) + \gamma_m h_m(x) \]
Where:
Gradient Boosting minimizes the loss function by moving in the direction of the negative gradient:
\[ r_{im} = -\left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F(x)=F_{m-1}(x)} \]
Where \( r_{im} \) are the pseudo-residuals that the next weak learner tries to fit.
The learning rate \( \nu \) controls the contribution of each weak learner:
\[ F_m(x) = F_{m-1}(x) + \nu \cdot \gamma_m h_m(x) \]
A smaller learning rate requires more iterations but can lead to better generalization.
Gradient Boosting is useful for:
Clustering algorithms group similar voters or constituencies based on their characteristics without prior labels.
# K-Means Clustering for Voter Segmentation
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
# Sample data: voter characteristics
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48, 29, 36, 41, 53, 59, 25, 31, 47, 58, 65],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19, 10, 13, 16, 18, 20, 9, 12, 16, 19, 21],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41, 49, 38, 43, 47, 40, 55, 50, 44, 37, 33],
'urbanization': [75, 85, 45, 90, 95, 65, 80, 88, 50, 92, 40, 70, 82, 89, 87, 30, 55, 78, 93, 96]
}
df = pd.DataFrame(data)
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
# Determine optimal number of clusters using elbow method
inertia = []
silhouette_scores = []
k_range = range(2, 8)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertia.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))
# Plot elbow method
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(k_range, inertia, 'bo-')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.subplot(1, 2, 2)
plt.plot(k_range, silhouette_scores, 'ro-')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score')
plt.tight_layout()
plt.show()
# Fit K-Means with optimal k
optimal_k = 3 # Based on elbow method and silhouette score
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
kmeans.fit(X_scaled)
# Add cluster labels to dataframe
df['cluster'] = kmeans.labels_
# Analyze cluster characteristics
cluster_summary = df.groupby('cluster').mean()
print("Cluster Summary:")
print(cluster_summary)
# Visualize clusters (first two dimensions)
plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans.labels_, cmap='viridis', alpha=0.7)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X')
plt.xlabel('Income (standardized)')
plt.ylabel('Education (standardized)')
plt.title('K-Means Clustering of Voters')
plt.colorbar(label='Cluster')
plt.show()
K-Means clustering aims to partition n observations into k clusters:
\[ \min_{C} \sum_{i=1}^{k} \sum_{x \in C_i} \|x - \mu_i\|^2 \]
Where:
The algorithm typically uses Euclidean distance:
\[ d(x, \mu) = \sqrt{\sum_{j=1}^{p} (x_j - \mu_j)^2} \]
We can use several methods to determine the optimal k:
K-Means clustering is useful for:
For example, we might discover clusters like: "Urban educated professionals", "Rural agricultural workers", or "Suburban middle-class families".
# Hierarchical Clustering for Voter Segmentation
import numpy as np
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
# Sample data: voter characteristics
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48, 29, 36, 41, 53, 59, 25, 31, 47, 58, 65],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19, 10, 13, 16, 18, 20, 9, 12, 16, 19, 21],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41, 49, 38, 43, 47, 40, 55, 50, 44, 37, 33],
'urbanization': [75, 85, 45, 90, 95, 65, 80, 88, 50, 92, 40, 70, 82, 89, 87, 30, 55, 78, 93, 96]
}
df = pd.DataFrame(data)
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
# Perform hierarchical clustering
linked = linkage(X_scaled, 'ward')
# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked,
orientation='top',
distance_sort='descending',
show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()
# Fit Agglomerative Clustering with optimal number of clusters
optimal_clusters = 3
agg_clustering = AgglomerativeClustering(n_clusters=optimal_clusters, affinity='euclidean', linkage='ward')
cluster_labels = agg_clustering.fit_predict(X_scaled)
# Add cluster labels to dataframe
df['cluster'] = cluster_labels
# Analyze cluster characteristics
cluster_summary = df.groupby('cluster').mean()
print("Cluster Summary:")
print(cluster_summary)
# Visualize clusters (first two dimensions)
plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=cluster_labels, cmap='viridis', alpha=0.7)
plt.xlabel('Income (standardized)')
plt.ylabel('Education (standardized)')
plt.title('Hierarchical Clustering of Voters')
plt.colorbar(label='Cluster')
plt.show()
Hierarchical clustering builds a hierarchy of clusters either through:
Different methods for calculating distance between clusters:
A dendrogram is a tree-like diagram that records the sequences of merges or splits:
Hierarchical clustering is useful for:
# DBSCAN Clustering for Voter Segmentation
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Sample data: voter characteristics
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48, 29, 36, 41, 53, 59, 25, 31, 47, 58, 65],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19, 10, 13, 16, 18, 20, 9, 12, 16, 19, 21],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41, 49, 38, 43, 47, 40, 55, 50, 44, 37, 33],
'urbanization': [75, 85, 45, 90, 95, 65, 80, 88, 50, 92, 40, 70, 82, 89, 87, 30, 55, 78, 93, 96]
}
df = pd.DataFrame(data)
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
# Perform DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
cluster_labels = dbscan.fit_predict(X_scaled)
# Add cluster labels to dataframe
df['cluster'] = cluster_labels
# Count number of clusters (excluding noise)
n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
n_noise = list(cluster_labels).count(-1)
print(f"Estimated number of clusters: {n_clusters}")
print(f"Estimated number of noise points: {n_noise}")
# Analyze cluster characteristics
cluster_summary = df.groupby('cluster').mean()
print("\nCluster Summary:")
print(cluster_summary)
# Visualize clusters (first two dimensions)
plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=cluster_labels, cmap='viridis', alpha=0.7)
plt.xlabel('Income (standardized)')
plt.ylabel('Education (standardized)')
plt.title('DBSCAN Clustering of Voters')
plt.colorbar(label='Cluster')
plt.show()
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are closely packed together:
DBSCAN is useful for:
# Gaussian Mixture Model for Voter Segmentation
import numpy as np
import pandas as pd
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Sample data: voter characteristics
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48, 29, 36, 41, 53, 59, 25, 31, 47, 58, 65],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19, 10, 13, 16, 18, 20, 9, 12, 16, 19, 21],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41, 49, 38, 43, 47, 40, 55, 50, 44, 37, 33],
'urbanization': [75, 85, 45, 90, 95, 65, 80, 88, 50, 92, 40, 70, 82, 89, 87, 30, 55, 78, 93, 96]
}
df = pd.DataFrame(data)
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
# Determine optimal number of components using BIC
bic_scores = []
n_components_range = range(1, 8)
for n_components in n_components_range:
gmm = GaussianMixture(n_components=n_components, random_state=42)
gmm.fit(X_scaled)
bic_scores.append(gmm.bic(X_scaled))
# Plot BIC scores
plt.figure(figsize=(10, 6))
plt.plot(n_components_range, bic_scores, 'bo-')
plt.xlabel('Number of components')
plt.ylabel('BIC score')
plt.title('BIC Scores for Different Numbers of Components')
plt.show()
# Fit GMM with optimal number of components
optimal_components = 3
gmm = GaussianMixture(n_components=optimal_components, random_state=42)
gmm.fit(X_scaled)
# Predict cluster labels
cluster_labels = gmm.predict(X_scaled)
# Add cluster labels to dataframe
df['cluster'] = cluster_labels
# Analyze cluster characteristics
cluster_summary = df.groupby('cluster').mean()
print("Cluster Summary:")
print(cluster_summary)
# Get probabilities for each point
probs = gmm.predict_proba(X_scaled)
print(f"\nProbability shape: {probs.shape}")
# Visualize clusters (first two dimensions)
plt.figure(figsize=(10, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=cluster_labels, cmap='viridis', alpha=0.7)
plt.xlabel('Income (standardized)')
plt.ylabel('Education (standardized)')
plt.title('GMM Clustering of Voters')
plt.colorbar(label='Cluster')
plt.show()
A GMM assumes that the data is generated from a mixture of several Gaussian distributions:
\[ p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x | \mu_k, \Sigma_k) \]
Where:
GMM parameters are estimated using the EM algorithm:
The optimal number of components can be determined using:
Where L is the likelihood, k is the number of parameters, and n is the number of samples.
GMM is useful for:
Neural networks can model complex nonlinear relationships between demographic factors and election outcomes.
# Basic Neural Network for Election Prediction
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Sample data
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48, 29, 36, 41, 53, 59],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19, 10, 13, 16, 18, 20],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41, 49, 38, 43, 47, 40],
'urbanization': [75, 85, 45, 90, 95, 65, 80, 88, 50, 92, 40, 70, 82, 89, 87],
'vote_share': [48, 55, 42, 52, 58, 45, 50, 54, 46, 53, 43, 49, 52, 56, 59]
}
df = pd.DataFrame(data)
# Prepare features and target
X = df.drop('vote_share', axis=1).values
y = df['vote_share'].values
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Neural network parameters
input_size = X_train.shape[1]
hidden_size = 5
output_size = 1
learning_rate = 0.01
epochs = 1000
# Initialize weights and biases
np.random.seed(42)
W1 = np.random.randn(input_size, hidden_size)
b1 = np.zeros((1, hidden_size))
W2 = np.random.randn(hidden_size, output_size)
b2 = np.zeros((1, output_size))
# Sigmoid activation function
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Training
loss_history = []
for epoch in range(epochs):
# Forward pass
z1 = np.dot(X_train_scaled, W1) + b1
a1 = sigmoid(z1)
z2 = np.dot(a1, W2) + b2
y_pred = z2 # Linear activation for output (regression)
# Calculate loss (MSE)
loss = np.mean((y_pred - y_train.reshape(-1, 1))**2)
loss_history.append(loss)
# Backward pass
dy_pred = 2 * (y_pred - y_train.reshape(-1, 1)) / len(y_train)
dW2 = np.dot(a1.T, dy_pred)
db2 = np.sum(dy_pred, axis=0, keepdims=True)
da1 = np.dot(dy_pred, W2.T)
dz1 = da1 * a1 * (1 - a1)
dW1 = np.dot(X_train_scaled.T, dz1)
db1 = np.sum(dz1, axis=0, keepdims=True)
# Update weights and biases
W2 -= learning_rate * dW2
b2 -= learning_rate * db2
W1 -= learning_rate * dW1
b1 -= learning_rate * db1
print("Neural Network Training Results:")
print("===============================")
print(f"Final loss: {loss_history[-1]:.4f}")
# Plot training loss
plt.figure(figsize=(10, 6))
plt.plot(loss_history)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Neural Network Training Loss')
plt.show()
# Make predictions
z1_test = np.dot(X_test_scaled, W1) + b1
a1_test = sigmoid(z1_test)
z2_test = np.dot(a1_test, W2) + b2
y_pred_test = z2_test
print(f"Predictions: {y_pred_test.flatten()}")
print(f"Actual values: {y_test}")
A basic neural network consists of:
Each neuron applies an activation function to the weighted sum of its inputs:
\[ z = w_1x_1 + w_2x_2 + \cdots + w_nx_n + b \]
\[ a = f(z) \]
Where f is the activation function (e.g., sigmoid, ReLU).
For a network with one hidden layer:
\[ z^{[1]} = W^{[1]} x + b^{[1]} \]
\[ a^{[1]} = f^{[1]}(z^{[1]}) \]
\[ z^{[2]} = W^{[2]} a^{[1]} + b^{[2]} \]
\[ \hat{y} = f^{[2]}(z^{[2]}) \]
For regression problems, we typically use mean squared error:
\[ J(W, b) = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2 \]
Backpropagation calculates gradients of the loss function with respect to the weights and biases using the chain rule:
\[ \frac{\partial J}{\partial W^{[2]}} = \frac{\partial J}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z^{[2]}} \frac{\partial z^{[2]}}{\partial W^{[2]}} \]
\[ \frac{\partial J}{\partial W^{[1]}} = \frac{\partial J}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z^{[2]}} \frac{\partial z^{[2]}}{\partial a^{[1]}} \frac{\partial a^{[1]}}{\partial z^{[1]}} \frac{\partial z^{[1]}}{\partial W^{[1]}} \]
# Detailed Backpropagation Implementation
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Sample data
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41],
'vote_share': [48, 55, 42, 52, 58, 45, 50, 54, 46, 53]
}
df = pd.DataFrame(data)
# Prepare features and target
X = df.drop('vote_share', axis=1).values
y = df['vote_share'].values
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Neural network parameters
input_size = X_train.shape[1]
hidden_size = 4
output_size = 1
learning_rate = 0.01
epochs = 2000
# Initialize weights and biases
np.random.seed(42)
W1 = np.random.randn(input_size, hidden_size) * 0.1
b1 = np.zeros((1, hidden_size))
W2 = np.random.randn(hidden_size, output_size) * 0.1
b2 = np.zeros((1, output_size))
# ReLU activation function
def relu(x):
return np.maximum(0, x)
# Derivative of ReLU
def relu_derivative(x):
return (x > 0).astype(float)
# Training with detailed backpropagation
loss_history = []
for epoch in range(epochs):
# Forward pass
z1 = np.dot(X_train_scaled, W1) + b1
a1 = relu(z1)
z2 = np.dot(a1, W2) + b2
y_pred = z2 # Linear activation for output
# Calculate loss (MSE)
loss = np.mean((y_pred - y_train.reshape(-1, 1))**2)
loss_history.append(loss)
# Backward pass - detailed step by step
m = len(y_train)
# Output layer gradients
dy_pred = 2 * (y_pred - y_train.reshape(-1, 1)) / m # dJ/dy_pred
dz2 = dy_pred # dJ/dz2 = dJ/dy_pred * dy_pred/dz2 (linear activation derivative is 1)
dW2 = np.dot(a1.T, dz2) # dJ/dW2 = dJ/dz2 * dz2/dW2
db2 = np.sum(dz2, axis=0, keepdims=True) # dJ/db2 = dJ/dz2 * dz2/db2
# Hidden layer gradients
da1 = np.dot(dz2, W2.T) # dJ/da1 = dJ/dz2 * dz2/da1
dz1 = da1 * relu_derivative(z1) # dJ/dz1 = dJ/da1 * da1/dz1
dW1 = np.dot(X_train_scaled.T, dz1) # dJ/dW1 = dJ/dz1 * dz1/dW1
db1 = np.sum(dz1, axis=0, keepdims=True) # dJ/db1 = dJ/dz1 * dz1/db1
# Update weights and biases
W2 -= learning_rate * dW2
b2 -= learning_rate * db2
W1 -= learning_rate * dW1
b1 -= learning_rate * db1
# Print progress
if epoch % 500 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}")
print(f"Final loss: {loss_history[-1]:.4f}")
# Plot training loss
plt.figure(figsize=(10, 6))
plt.plot(loss_history)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Neural Network Training Loss (Backpropagation)')
plt.show()
# Make predictions
z1_test = np.dot(X_test_scaled, W1) + b1
a1_test = relu(z1_test)
z2_test = np.dot(a1_test, W2) + b2
y_pred_test = z2_test
print(f"Predictions: {y_pred_test.flatten()}")
print(f"Actual values: {y_test}")
Backpropagation is the algorithm used to train neural networks by efficiently calculating gradients:
The chain rule is used to compute gradients layer by layer:
\[ \frac{\partial J}{\partial W^{[l]}} = \frac{\partial J}{\partial z^{[l]}} \frac{\partial z^{[l]}}{\partial W^{[l]}} \]
\[ \frac{\partial J}{\partial z^{[l]}} = \frac{\partial J}{\partial a^{[l]}} \frac{\partial a^{[l]}}{\partial z^{[l]}} \]
\[ \frac{\partial J}{\partial a^{[l-1]}} = \frac{\partial J}{\partial z^{[l]}} \frac{\partial z^{[l]}}{\partial a^{[l-1]}} \]
For a network with L layers:
\[ \delta^{[L]} = \frac{\partial J}{\partial a^{[L]}} \frac{\partial a^{[L]}}{\partial z^{[L]}} \]
\[ \delta^{[l]} = (\delta^{[l+1]} (W^{[l+1]})^T) \odot \frac{\partial a^{[l]}}{\partial z^{[l]}} \]
\[ \frac{\partial J}{\partial W^{[l]}} = \delta^{[l]} (a^{[l-1]})^T \]
\[ \frac{\partial J}{\partial b^{[l]}} = \delta^{[l]} \]
Common activation function derivatives:
# Comparison of Activation Functions
import numpy as np
import matplotlib.pyplot as plt
# Define activation functions
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def relu(x):
return np.maximum(0, x)
def tanh(x):
return np.tanh(x)
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
def softplus(x):
return np.log(1 + np.exp(x))
# Create input values
x = np.linspace(-5, 5, 100)
# Calculate activation values
y_sigmoid = sigmoid(x)
y_relu = relu(x)
y_tanh = tanh(x)
y_leaky_relu = leaky_relu(x)
y_softplus = softplus(x)
# Plot activation functions
plt.figure(figsize=(12, 8))
plt.subplot(2, 3, 1)
plt.plot(x, y_sigmoid)
plt.title('Sigmoid')
plt.grid(True)
plt.subplot(2, 3, 2)
plt.plot(x, y_relu)
plt.title('ReLU')
plt.grid(True)
plt.subplot(2, 3, 3)
plt.plot(x, y_tanh)
plt.title('Tanh')
plt.grid(True)
plt.subplot(2, 3, 4)
plt.plot(x, y_leaky_relu)
plt.title('Leaky ReLU')
plt.grid(True)
plt.subplot(2, 3, 5)
plt.plot(x, y_softplus)
plt.title('Softplus')
plt.grid(True)
plt.tight_layout()
plt.show()
# Compare performance with different activation functions
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Sample data
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48, 29, 36, 41, 53, 59],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19, 10, 13, 16, 18, 20],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41, 49, 38, 43, 47, 40],
'urbanization': [75, 85, 45, 90, 95, 65, 80, 88, 50, 92, 40, 70, 82, 89, 87],
'vote_share': [48, 55, 42, 52, 58, 45, 50, 54, 46, 53, 43, 49, 52, 56, 59]
}
df = pd.DataFrame(data)
# Prepare features and target
X = df.drop('vote_share', axis=1).values
y = df['vote_share'].values
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train neural networks with different activation functions
def train_nn(activation_fn, activation_derivative, epochs=1000, lr=0.01):
np.random.seed(42)
W1 = np.random.randn(X_train.shape[1], 5) * 0.1
b1 = np.zeros((1, 5))
W2 = np.random.randn(5, 1) * 0.1
b2 = np.zeros((1, 1))
loss_history = []
for epoch in range(epochs):
# Forward pass
z1 = np.dot(X_train_scaled, W1) + b1
a1 = activation_fn(z1)
z2 = np.dot(a1, W2) + b2
y_pred = z2
# Calculate loss
loss = np.mean((y_pred - y_train.reshape(-1, 1))**2)
loss_history.append(loss)
# Backward pass
dy_pred = 2 * (y_pred - y_train.reshape(-1, 1)) / len(y_train)
dW2 = np.dot(a1.T, dy_pred)
db2 = np.sum(dy_pred, axis=0, keepdims=True)
da1 = np.dot(dy_pred, W2.T)
dz1 = da1 * activation_derivative(z1)
dW1 = np.dot(X_train_scaled.T, dz1)
db1 = np.sum(dz1, axis=0, keepdims=True)
# Update weights
W2 -= lr * dW2
b2 -= lr * db2
W1 -= lr * dW1
b1 -= lr * db1
return loss_history
# Define activation functions and their derivatives
def sigmoid_derivative(x):
return sigmoid(x) * (1 - sigmoid(x))
def relu_derivative(x):
return (x > 0).astype(float)
def tanh_derivative(x):
return 1 - np.tanh(x)**2
def leaky_relu_derivative(x, alpha=0.01):
return np.where(x > 0, 1, alpha)
def softplus_derivative(x):
return sigmoid(x)
# Train with different activation functions
activations = {
'Sigmoid': (sigmoid, sigmoid_derivative),
'ReLU': (relu, relu_derivative),
'Tanh': (tanh, tanh_derivative),
'Leaky ReLU': (lambda x: leaky_relu(x, 0.01), lambda x: leaky_relu_derivative(x, 0.01)),
'Softplus': (softplus, softplus_derivative)
}
results = {}
for name, (act_fn, act_derivative) in activations.items():
loss_history = train_nn(act_fn, act_derivative, epochs=1000)
results[name] = loss_history
print(f"{name}: Final loss = {loss_history[-1]:.4f}")
# Plot comparison
plt.figure(figsize=(10, 6))
for name, loss_history in results.items():
plt.plot(loss_history, label=name)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Comparison of Activation Functions')
plt.legend()
plt.grid(True)
plt.show()
Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns:
| Function | Range | Advantages | Disadvantages |
|---|---|---|---|
| Sigmoid | (0, 1) | Smooth gradient, output interpretation | Vanishing gradient, not zero-centered |
| Tanh | (-1, 1) | Zero-centered, stronger gradient | Vanishing gradient |
| ReLU | [0, ∞) | Computationally efficient, avoids vanishing gradient | Dying ReLU problem, not zero-centered |
| Leaky ReLU | (-∞, ∞) | Prevents dying ReLU, computational efficiency | Results not consistent |
| Softplus | (0, ∞) | Smooth approximation of ReLU | Computationally expensive |
Guidelines for selecting activation functions:
For election prediction:
Deep learning models can capture complex patterns in election data using multiple layers of abstraction.
# CNN for Regional Election Patterns (Conceptual Example)
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.optimizers import Adam
# This is a conceptual example - in practice, you would need regional data formatted as images
# For example, each region could be represented as a grid of demographic and voting data
# Generate sample data (simulated regional data)
num_regions = 1000
height, width, channels = 32, 32, 3 # Simulating image-like data
# Simulated input: regional data as "images"
X = np.random.rand(num_regions, height, width, channels)
# Simulated output: vote share for each region
y = np.random.rand(num_regions) * 100 # Vote share between 0-100
# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Build CNN model
model = Sequential([
Conv2D(32, (3, 3), activation='relu', input_shape=(height, width, channels)),
MaxPooling2D((2, 2)),
Conv2D(64, (3, 3), activation='relu'),
MaxPooling2D((2, 2)),
Conv2D(64, (3, 3), activation='relu'),
Flatten(),
Dense(64, activation='relu'),
Dropout(0.5),
Dense(1) # Output layer for regression
])
# Compile model
model.compile(optimizer=Adam(learning_rate=0.001),
loss='mse',
metrics=['mae'])
# Display model architecture
model.summary()
# Train model
history = model.fit(X_train, y_train,
epochs=50,
batch_size=32,
validation_split=0.2,
verbose=1)
# Evaluate model
test_loss, test_mae = model.evaluate(X_test, y_test, verbose=0)
print(f"Test MSE: {test_loss:.4f}, Test MAE: {test_mae:.4f}")
# Make predictions
predictions = model.predict(X_test[:5])
print(f"Predictions: {predictions.flatten()}")
print(f"Actual values: {y_test[:5]}")
CNNs are designed to process grid-like data such as images. They use convolutional layers to detect spatial patterns:
\[ (f * g)(t) = \int_{-\infty}^{\infty} f(\tau) g(t - \tau) d\tau \]
In discrete form for 2D images:
\[ (I * K)(i, j) = \sum_{m} \sum_{n} I(i+m, j+n) K(m, n) \]
Where I is the input image and K is the kernel (filter).
A typical CNN consists of:
For election forecasting, CNNs can be applied to:
Each "pixel" in the input could represent demographic or voting data for a small geographic area.
# RNN for Election Time Series Forecasting
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
# Generate sample time series data
np.random.seed(42)
time_steps = 100
n_features = 5
n_samples = 1000
# Create synthetic time series data
X = np.random.randn(n_samples, time_steps, n_features)
y = np.random.rand(n_samples) * 100 # Vote share between 0-100
# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Build RNN model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Dropout
from tensorflow.keras.optimizers import Adam
model = Sequential([
SimpleRNN(50, activation='relu', input_shape=(time_steps, n_features), return_sequences=True),
Dropout(0.2),
SimpleRNN(50, activation='relu'),
Dropout(0.2),
Dense(1)
])
# Compile model
model.compile(optimizer=Adam(learning_rate=0.001),
loss='mse',
metrics=['mae'])
# Display model architecture
model.summary()
# Train model
history = model.fit(X_train, y_train,
epochs=50,
batch_size=32,
validation_split=0.2,
verbose=1)
# Evaluate model
test_loss, test_mae = model.evaluate(X_test, y_test, verbose=0)
print(f"Test MSE: {test_loss:.4f}, Test MAE: {test_mae:.4f}")
# Make predictions
predictions = model.predict(X_test[:5])
print(f"Predictions: {predictions.flatten()}")
print(f"Actual values: {y_test[:5]}")
# Plot training history
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['mae'], label='Training MAE')
plt.plot(history.history['val_mae'], label='Validation MAE')
plt.title('Model MAE')
plt.xlabel('Epoch')
plt.ylabel('MAE')
plt.legend()
plt.tight_layout()
plt.show()
RNNs are designed to process sequential data by maintaining a hidden state that captures information about previous elements in the sequence:
\[ h_t = f(W_{xh} x_t + W_{hh} h_{t-1} + b_h) \]
\[ y_t = W_{hy} h_t + b_y \]
Where:
These challenges led to the development of more advanced architectures like LSTM and GRU.
RNNs are useful for:
# Transfer Learning for Election Prediction
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.optimizers import Adam
# Sample data
data = {
'income': [35, 42, 28, 55, 62, 38, 45, 51, 33, 48, 29, 36, 41, 53, 59],
'education': [12, 16, 10, 18, 20, 14, 15, 17, 11, 19, 10, 13, 16, 18, 20],
'age': [42, 35, 51, 45, 39, 48, 42, 36, 54, 41, 49, 38, 43, 47, 40],
'urbanization': [75, 85, 45, 90, 95, 65, 80, 88, 50, 92, 40, 70, 82, 89, 87],
'vote_share': [48, 55, 42, 52, 58, 45, 50, 54, 46, 53, 43, 49, 52, 56, 59]
}
df = pd.DataFrame(data)
# Prepare features and target
X = df.drop('vote_share', axis=1).values
y = df['vote_share'].values
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Step 1: Train a base model on a related task (e.g., predicting party affiliation)
# For demonstration, we'll create a base model architecture
# Base model input
base_input = Input(shape=(X_train.shape[1],))
# Base model layers
x = Dense(64, activation='relu')(base_input)
x = Dropout(0.3)(x)
x = Dense(32, activation='relu')(x)
base_output = Dense(16, activation='relu')(x)
# Create base model
base_model = Model(inputs=base_input, outputs=base_output, name='base_model')
# Compile and train base model (in practice, this would be trained on a larger dataset)
base_model.compile(optimizer=Adam(learning_rate=0.001), loss='mse')
# base_model.fit(X_base, y_base, epochs=100, verbose=0) # Would train on actual base data
print("Base model architecture:")
base_model.summary()
# Step 2: Transfer learning - use base model for election prediction
# Freeze base model layers (optional)
# base_model.trainable = False
# Create transfer model
transfer_input = Input(shape=(X_train.shape[1],))
x = base_model(transfer_input)
x = Dense(8, activation='relu')(x)
x = Dropout(0.2)(x)
transfer_output = Dense(1, activation='linear')(x) # Regression output
# Create transfer model
transfer_model = Model(inputs=transfer_input, outputs=transfer_output, name='transfer_model')
# Compile transfer model
transfer_model.compile(optimizer=Adam(learning_rate=0.0005), loss='mse', metrics=['mae'])
print("\nTransfer model architecture:")
transfer_model.summary()
# Train transfer model
history = transfer_model.fit(X_train_scaled, y_train,
epochs=200,
batch_size=8,
validation_split=0.2,
verbose=1)
# Evaluate transfer model
test_loss, test_mae = transfer_model.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Transfer model Test MSE: {test_loss:.4f}, Test MAE: {test_mae:.4f}")
# Compare with model trained from scratch
# Create model from scratch
scratch_input = Input(shape=(X_train.shape[1],))
x = Dense(64, activation='relu')(scratch_input)
x = Dropout(0.3)(x)
x = Dense(32, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(16, activation='relu')(x)
x = Dropout(0.1)(x)
scratch_output = Dense(1, activation='linear')(x)
scratch_model = Model(inputs=scratch_input, outputs=scratch_output)
scratch_model.compile(optimizer=Adam(learning_rate=0.001), loss='mse', metrics=['mae'])
# Train scratch model
scratch_history = scratch_model.fit(X_train_scaled, y_train,
epochs=200,
batch_size=8,
validation_split=0.2,
verbose=0)
# Evaluate scratch model
scratch_loss, scratch_mae = scratch_model.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Scratch model Test MSE: {scratch_loss:.4f}, Test MAE: {scratch_mae:.4f}")
# Compare performance
print(f"\nPerformance comparison:")
print(f"Transfer learning MAE: {test_mae:.4f}")
print(f"Scratch model MAE: {scratch_mae:.4f}")
print(f"Improvement: {((scratch_mae - test_mae) / scratch_mae * 100):.2f}%")
Transfer learning leverages knowledge gained from solving one problem and applies it to a different but related problem:
\[ \theta_{\text{target}} = \theta_{\text{source}} + \Delta\theta \]
Where:
Transfer learning can be applied to election prediction by:
Generate actionable insights and recommendations to optimize campaign strategies using advanced optimization algorithms and explainable AI techniques.
Based on predictive models and historical data analysis, here are actionable recommendations for optimizing election campaign strategies.
We use linear programming to maximize expected seats subject to resource constraints:
Objective function: \[ \max \sum_{i=1}^{n} P_i(wins) \cdot S_i \]
Subject to: \[ \sum_{i=1}^{n} R_i \leq R_{total} \]
And: \[ R_i^{min} \leq R_i \leq R_i^{max} \quad \forall i \]
Where \( P_i(wins) \) is the probability of winning constituency i, \( S_i \) is the strategic importance, and \( R_i \) is resources allocated.
# Linear Programming for Campaign Strategy Optimization
from scipy.optimize import linprog
# Coefficients for objective function (negative for maximization)
c = [-0.85, -0.70, -0.60, -0.45] # -P_i(wins)
# Inequality constraints (resource allocation)
A = [[1, 1, 1, 1]] # Total resources
b = [100] # Total resource constraint
# Bounds for each variable
bounds = [(10, 40), (15, 35), (20, 30), (15, 25)]
# Solve the linear programming problem
result = linprog(c, A_ub=A, b_ub=b, bounds=bounds, method='highs')
print("Optimal resource allocation:", result.x)
print("Maximum expected seats:", -result.fun)
| Region | Priority Level | Recommended Approach | Expected Impact | Resource Allocation |
|---|---|---|---|---|
| North India | High | Focus on development agenda and nationalism | +5-7% vote swing | 35% of total resources |
| South India | Medium | Emphasize regional issues and alliances | +3-5% vote swing | 25% of total resources |
| East India | Low | Grassroots mobilization and welfare schemes | +2-3% vote swing | 20% of total resources |
| West India | Medium | Business-friendly policies and infrastructure | +4-6% vote swing | 20% of total resources |
Data-driven recommendations for allocating campaign resources using optimization algorithms to maximize electoral impact.
We use genetic algorithms to find near-optimal resource allocation across regions and campaign activities:
Fitness function: \[ \max \sum_{i=1}^{n} \sum_{j=1}^{m} E_{ij} \cdot R_{ij} \]
Subject to: \[ \sum_{j=1}^{m} R_{ij} \leq B_i \quad \forall i \]
And: \[ \sum_{i=1}^{n} \sum_{j=1}^{m} R_{ij} \leq R_{total} \]
Where \( E_{ij} \) is effectiveness of resource j in region i, \( R_{ij} \) is resources allocated, and \( B_i \) is regional budget cap.
# Genetic Algorithm for Resource Allocation
import numpy as np
from geneticalgorithm import geneticalgorithm as ga
# Effectiveness matrix (regions x activities)
effectiveness = np.array([
[0.9, 0.7, 0.8, 0.6], # North India
[0.7, 0.8, 0.9, 0.7], # South India
[0.6, 0.9, 0.7, 0.8], # East India
[0.8, 0.6, 0.7, 0.9] # West India
])
def fitness_function(X):
# Reshape the solution vector into a matrix
allocation = X.reshape((4, 4))
# Calculate total effectiveness
total_effectiveness = np.sum(effectiveness * allocation)
# Penalty for constraint violations
penalty = 0
regional_budgets = [40, 30, 20, 20] # Budget caps for each region
for i in range(4):
if np.sum(allocation[i]) > regional_budgets[i]:
penalty += 1000 * (np.sum(allocation[i]) - regional_budgets[i])
if np.sum(allocation) > 110: # Total budget constraint
penalty += 1000 * (np.sum(allocation) - 110)
return - (total_effectiveness - penalty) # Negative for minimization
# Set up genetic algorithm
varbounds = np.array([[0, 20]] * 16) # 16 variables (4 regions x 4 activities)
algorithm_param = {'max_num_iteration': 1000,
'population_size': 100,
'mutation_probability': 0.1,
'elit_ratio': 0.01,
'crossover_probability': 0.5,
'parents_portion': 0.3,
'crossover_type': 'uniform',
'max_iteration_without_improv': 300}
model = ga(function=fitness_function, dimension=16, variable_type='real', variable_boundaries=varbounds, algorithm_parameters=algorithm_param)
model.run()
# Get the optimal allocation
optimal_allocation = model.output_dict['variable'].reshape((4, 4))
print("Optimal resource allocation:\n", optimal_allocation)
print("Total effectiveness:", -model.output_dict['function'])
Data-driven recommendations for crafting and targeting campaign messages using natural language processing and reinforcement learning.
We use Q-learning to optimize message selection based on voter response:
Q-value update: \[ Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)] \]
Where:
# Reinforcement Learning for Message Optimization
import numpy as np
# Define states (voter segments) and actions (message types)
states = ['Youth', 'Middle-Aged', 'Senior', 'Elderly']
actions = ['Economic', 'Security', 'Welfare', 'Education']
# Initialize Q-table
Q = np.zeros((len(states), len(actions)))
# Hyperparameters
alpha = 0.1 # Learning rate
gamma = 0.9 # Discount factor
epsilon = 0.1 # Exploration rate
# Simulated training process
for episode in range(1000):
state = np.random.randint(0, len(states)) # Random initial state
for step in range(10): # 10 steps per episode
# Epsilon-greedy action selection
if np.random.random() < epsilon:
action = np.random.randint(0, len(actions)) # Explore
else:
action = np.argmax(Q[state]) # Exploit
# Simulate reward based on message effectiveness
effectiveness_matrix = np.array([
[0.8, 0.6, 0.7, 0.9], # Youth
[0.9, 0.7, 0.6, 0.8], # Middle-Aged
[0.7, 0.9, 0.8, 0.6], # Senior
[0.6, 0.8, 0.9, 0.7] # Elderly
])
reward = effectiveness_matrix[state, action] * 10
# Next state (simulate state transition)
next_state = np.random.randint(0, len(states))
# Update Q-value
Q[state, action] = Q[state, action] + alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
state = next_state
print("Optimized Q-table:")
for i, state in enumerate(states):
print(f"{state}: {Q[i]}")
| Message Theme | Youth (18-25) | Middle-Aged (26-45) | Senior (46-60) | Elderly (60+) | Overall Effectiveness |
|---|---|---|---|---|---|
| Economic Development | 68% | 82% | 75% | 63% |
|
| National Security | 55% | 73% | 88% | 92% |
|
| Social Welfare | 72% | 65% | 78% | 85% |
|
Precision targeting of voter segments using clustering algorithms and optimization techniques to maximize campaign efficiency.
We use K-means clustering to identify distinct voter segments based on demographic and behavioral characteristics:
Objective function: \[ \min \sum_{i=1}^{k} \sum_{x \in C_i} \|x - \mu_i\|^2 \]
Where:
# K-Means Clustering for Voter Segmentation
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Sample voter data
data = {
'age': [25, 35, 45, 55, 65, 28, 38, 48, 58, 68],
'income': [40, 60, 80, 40, 60, 45, 65, 85, 45, 65],
'education': [12, 16, 14, 10, 8, 13, 17, 15, 11, 9],
'previous_vote': [1, 1, 0, 0, 1, 1, 0, 0, 1, 1] # 1=voted for us, 0=did not
}
df = pd.DataFrame(data)
# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
# Apply K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(scaled_data)
# Add cluster labels to dataframe
df['cluster'] = clusters
# Analyze cluster characteristics
cluster_summary = df.groupby('cluster').mean()
print("Cluster characteristics:")
print(cluster_summary)
# Calculate cluster sizes
cluster_sizes = df['cluster'].value_counts()
print("\nCluster sizes:")
print(cluster_sizes)
| Voter Segment | Size (% of electorate) | Current Support | Swing Potential | Recommended Approach | Priority Level |
|---|---|---|---|---|---|
| Loyal Supporters | 32% | 95% | Low | Mobilization and turnout focus | Medium |
| Lean Supporters | 18% | 65% | Medium | Reinforcement messaging | High |
| True Undecided | 15% | N/A | High | Issue-based persuasion | Critical |
Using SHAP and LIME to interpret machine learning models and provide transparent, actionable recommendations for campaign strategy based on exit poll data.
SHAP values provide a game-theoretic approach to explain the output of any machine learning model. For exit poll analysis, SHAP helps us understand which factors most influence voting behavior and by how much.
The SHAP value for feature i is calculated as:
\[ \phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N| - |S| - 1)!}{|N|!} [f(S \cup \{i\}) - f(S)] \]
Where:
Consider a constituency with the following features:
To calculate the SHAP value for "Previous vote" (feature i):
For a specific subset S = {age, income}:
\[ \phi_{\text{prev\_vote}} += \frac{2!(5-2-1)!}{5!} [f(\{\text{age, income, prev\_vote}\}) - f(\{\text{age, income}\})] \]
\[ = \frac{2! \cdot 2!}{5!} [0.62 - 0.55] = \frac{2 \cdot 2}{120} \times 0.07 = 0.00233 \]
This process is repeated for all 16 possible subsets of the 4 other features, and the results are summed to get the final SHAP value.
# SHAP Analysis for Exit Poll Interpretation
import shap
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
# Generate realistic exit poll data
np.random.seed(42)
n_constituencies = 500
# Simulate features based on real election data patterns
data = {
'avg_age': np.random.normal(45, 10, n_constituencies),
'avg_income': np.random.lognormal(10.5, 0.35, n_constituencies),
'education_index': np.random.beta(2, 3, n_constituencies) * 100,
'previous_vote_share': np.random.uniform(30, 70, n_constituencies),
'campaign_visits': np.random.poisson(3, n_constituencies),
'rural_urban_mix': np.random.uniform(0, 1, n_constituencies), # 0=rural, 1=urban
'incumbent_advantage': np.random.uniform(-10, 10, n_constituencies) # Negative for challenger advantage
}
df = pd.DataFrame(data)
# Simulate vote share based on realistic relationships
df['vote_share'] = (
0.35 * (df['previous_vote_share'] - 50) / 20 + # Normalized previous vote
0.25 * (df['avg_income'] - 50000) / 20000 + # Normalized income
0.15 * (df['education_index'] - 50) / 25 + # Normalized education
0.10 * df['campaign_visits'] / 5 + # Campaign visits effect
0.08 * (df['rural_urban_mix'] - 0.5) * 2 + # Urban/rural effect
0.07 * df['incumbent_advantage'] / 10 + # Incumbent advantage
np.random.normal(0, 3, n_constituencies) # Random noise
) * 10 + 50 # Scale to 0-100 range centered around 50
# Convert to classification problem (win/lose)
df['win'] = (df['vote_share'] > 50).astype(int)
# Prepare features and target
feature_names = ['avg_age', 'avg_income', 'education_index', 'previous_vote_share',
'campaign_visits', 'rural_urban_mix', 'incumbent_advantage']
X = df[feature_names]
y = df['win']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Create SHAP explainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# Plot summary plot
shap.summary_plot(shap_values[1], X_test, feature_names=feature_names, show=False)
# Calculate mean absolute SHAP values for feature importance
mean_abs_shap = np.mean(np.abs(shap_values[1]), axis=0)
print("Mean absolute SHAP values (feature importance):")
for i, feature in enumerate(feature_names):
print(f"{feature}: {mean_abs_shap[i]:.4f}")
# Analyze a specific constituency
constituency_idx = 10 # A swing constituency
print(f"\nAnalysis for constituency {constituency_idx}:")
print(f"Actual vote share: {df.iloc[constituency_idx]['vote_share']:.1f}%")
print(f"Predicted probability of winning: {model.predict_proba([X_test.iloc[constituency_idx]])[0][1]:.3f}")
print("Feature contributions (SHAP values):")
for i, feature in enumerate(feature_names):
print(f"{feature}: {shap_values[1][constituency_idx][i]:.4f}")
In our exit poll analysis, SHAP values reveal:
LIME explains individual predictions by approximating the complex model locally with an interpretable one. For exit polls, this helps understand why specific constituencies voted the way they did.
The LIME explanation is obtained by solving the optimization problem:
\[ \xi(x) = \arg\min_{g \in G} \mathcal{L}(f, g, \pi_x) + \Omega(g) \]
Where:
For a specific constituency with features:
LIME would:
\[ \mathcal{L}(f, g, \pi_x) = \sum_{z \in Z} \pi_x(z) (f(z) - g(z))^2 \]
The resulting explanation might be:
\[ g(x) = 0.45 + 0.32 \cdot \text{prev\_vote} + 0.28 \cdot \text{incumbent} + 0.19 \cdot \text{campaign} + 0.15 \cdot \text{urban} \]
# LIME for Constituency-Level Analysis
import lime
import lime.lime_tabular
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
# Create LIME explainer
explainer = lime.lime_tabular.LimeTabularExplainer(
X_train.values,
training_labels=y_train,
feature_names=feature_names,
class_names=['Loss', 'Win'],
mode='classification',
discretize_continuous=True,
random_state=42
)
# Select a constituency to explain - a close race
close_races = X_test[(model.predict_proba(X_test)[:, 1] > 0.4) &
(model.predict_proba(X_test)[:, 1] < 0.6)]
constituency_idx = close_races.index[0]
instance = X_test.loc[constituency_idx].values
# Explain the instance
exp = explainer.explain_instance(
instance,
model.predict_proba,
num_features=5,
top_labels=1
)
# Show explanation
print(f"LIME explanation for constituency {constituency_idx}:")
print(f"Actual result: {'Win' if y_test.loc[constituency_idx] == 1 else 'Loss'}")
print(f"Predicted probability: {model.predict_proba([instance])[0][1]:.3f}")
print("\nFeature contributions:")
for feature, weight in exp.as_list(label=1):
print(f"{feature}: {weight:.4f}")
# Compare with SHAP for the same constituency
shap_explanation = shap_values[1][X_test.index.get_loc(constituency_idx)]
print("\nSHAP values for comparison:")
for i, feature in enumerate(feature_names):
print(f"{feature}: {shap_explanation[i]:.4f}")
# Plot explanation
plt.figure(figsize=(10, 6))
exp.as_pyplot_figure()
plt.title(f"LIME Explanation for Constituency {constituency_idx}")
plt.tight_layout()
plt.show()
# Analyze a surprising result - model predicted win but actual loss
false_wins = X_test[(model.predict_proba(X_test)[:, 1] > 0.7) & (y_test == 0)]
if len(false_wins) > 0:
surprise_idx = false_wins.index[0]
surprise_instance = X_test.loc[surprise_idx].values
print(f"\nAnalyzing surprising result - constituency {surprise_idx}:")
print(f"Predicted win with probability {model.predict_proba([surprise_instance])[0][1]:.3f} but actually lost")
exp_surprise = explainer.explain_instance(
surprise_instance,
model.predict_proba,
num_features=5,
top_labels=1
)
print("LIME explanation:")
for feature, weight in exp_surprise.as_list(label=1):
print(f"{feature}: {weight:.4f}")
LIME helps campaign strategists understand:
Allocation weight \( w_i = \frac{|\phi_i|}{\sum_{j=1}^{n} |\phi_j|} \)
Message impact \( I = \sum_{i=1}^{n} \beta_i \cdot x_i \)
We use various statistical methods to analyze exit poll data and make predictions.
The Z-score measures how many standard deviations an observation is from the mean:
\[ Z = \frac{X - \mu}{\sigma} \]
Where:
For example, if a constituency has 55% votes for BJP, and the state average is 45% with a standard deviation of 5%:
\[ Z = \frac{55 - 45}{5} = 2 \]
This constituency is 2 standard deviations above the mean, indicating strong BJP support.
To compare two proportions (e.g., urban vs. rural support for a party):
\[ Z = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{SE_{\hat{p}_1 - \hat{p}_2}} \]
Where the standard error of the difference is:
\[ SE_{\hat{p}_1 - \hat{p}_2} = \sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})} \]
We use matrix operations to process large exit poll datasets and calculate seat projections:
| Constituency | Sample Size | BJP Vote % | INC Vote % | Margin of Error | Projected Winner |
|---|---|---|---|---|---|
| Varanasi | 850 | 58.2 ± 3.1 | 32.5 ± 2.8 | ±3.4% | BJP |
| Amethi | 920 | 45.3 ± 3.5 | 47.8 ± 3.2 | ±3.2% | INC |
| Gandhinagar | 780 | 62.1 ± 3.8 | 28.5 ± 3.1 | ±3.5% | BJP |
| Hyderabad | 950 | 22.4 ± 2.9 | 18.7 ± 2.7 | ±3.2% | TRS |
In exit poll data, we distinguish between:
We use statistical methods to separate signal from noise:
\[ \text{Observed Difference} = \text{True Difference} + \text{Random Error} \]
Where random error represents noise due to sampling variability.
The chart shows how we distinguish true voting trends (signal) from random sampling variations (noise).
Relationship between sample size and margin of error in exit polling.
We test various hypotheses about voting patterns:
\[ Z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})}} \]
For comparing proportions between two groups, where \( \hat{p} = \frac{x_1 + x_2}{n_1 + n_2} \).
| Scenario | Null Hypothesis (H₀) | Alternative Hypothesis (H₁) |
|---|---|---|
| Party Lead | p₁ = p₂ | p₁ > p₂ |
| Gender Gap | pmale = pfemale | pmale ≠ pfemale |
| Regional Variation | pnorth = psouth | pnorth ≠ psouth |
We distinguish between:
In election forecasting, even small percentage changes can be practically significant due to the winner-take-all nature of many electoral systems.
Example: A 1.5% lead may be statistically significant with a large sample but may not be practically significant in a first-past-the-post system if the lead is concentrated in safe seats.
Key considerations for exit polls:
We conduct power analysis to determine the sample size needed to detect effects in exit poll data:
\[ n = \frac{(z_{\alpha/2} + z_{\beta})^2 \cdot p(1-p)}{(\Delta)^2} \]
Where:
This is the z-score that corresponds to your chosen significance level (α). For:
It represents the cutoff point beyond which we reject the null hypothesis.
This is the z-score that corresponds to the desired statistical power (1-β). For:
It represents the ability to detect an effect when there truly is one.
The significance level (α), confidence level, and z-scores are mathematically interconnected:
| Confidence Level | Significance Level (α) | Alpha Division (α/2) | Z-Score (zα/2) | Visual Representation |
|---|---|---|---|---|
| 90% | 0.10 | 0.05 | 1.645 |
|
| 95% | 0.05 | 0.025 | 1.960 |
|
| 99% | 0.01 | 0.005 | 2.576 |
|
Key Relationships:
Adjust the parameters to see how they affect the required sample size:
| Scenario | Effect Size (Δ) | α | Power (1-β) | Sample Size (n) |
|---|---|---|---|---|
| National vote share | 0.03 | 0.05 | 0.80 | 1,068 |
| State-level prediction | 0.05 | 0.05 | 0.80 | 384 |
| Gender gap detection | 0.07 | 0.05 | 0.90 | 558 |
| Close constituency | 0.02 | 0.05 | 0.95 | 4,802 |
In election forecasting, power analysis helps us:
The minimum detectable effect size (Δ) represents the smallest difference that is both statistically significant and politically meaningful in election forecasting.
In electoral contexts, Δ is the smallest percentage point difference that could change political outcomes:
Political analysts consider several factors when setting Δ:
The standard normal distribution is a fundamental concept in statistics that plays a crucial role in calculating Z Alpha/2 values for exit poll analysis.
The standard normal distribution is a normal distribution with:
The probability density function (PDF) of the standard normal distribution is:
Where:
The cumulative distribution function Φ(z) gives the probability that a standard normal random variable is less than or equal to z:
Where:
The inverse CDF, denoted as Φ-1(p), returns the value z such that Φ(z) = p.
For a given probability p, Φ-1(p) finds the z-value where:
This is computed using:
1 Start with probability p (e.g., 0.975 for 95% confidence)
2 Use approximation formula or software to find z
3 For p=0.975, z ≈ 1.96
Several numerical approximations exist for calculating Φ-1(p):
One common approximation (for p ≥ 0.5):
Where t = √(-2·ln(1-p)) and c₀, c₁, c₂, d₁, d₂, d₃ are constants
To find Zα/2 for a given confidence level:
1 Determine α (e.g., α=0.05 for 95% confidence)
2 Calculate α/2 (e.g., 0.05/2 = 0.025)
3 Find 1 - α/2 (e.g., 1 - 0.025 = 0.975)
4 Compute Φ-1(1 - α/2) (e.g., Φ-1(0.975) ≈ 1.96)
α = 0.05
α/2 = 0.025
1 - α/2 = 0.975
Zα/2 = Φ-1(0.975) ≈ 1.96
α = 0.01
α/2 = 0.005
1 - α/2 = 0.995
Zα/2 = Φ-1(0.995) ≈ 2.576
Understanding how effect size and critical z-values are calculated is essential for proper exit poll design and interpretation.
The effect size (Δ) in exit polls typically represents the minimum detectable difference in proportions:
Where:
1 Determine the politically meaningful difference
2 Set p₀ (e.g., 0.5 for a tied race)
3 Calculate p₁ = p₀ + Δ
4 Use these values in sample size calculations
Zα/2 represents the critical value from the standard normal distribution for a given significance level (α):
Where:
| Confidence Level | α (Significance) | α/2 | Zα/2 |
|---|---|---|---|
| 90% | 0.10 | 0.05 | 1.645 |
| 95% | 0.05 | 0.025 | 1.960 |
| 99% | 0.01 | 0.005 | 2.576 |
The relationship between effect size, z-values, and sample size is given by:
Where:
| Effect Size (Δ) | Statistical Meaning | Political Significance in Indian Elections | Example Impact |
|---|---|---|---|
| 0.01-0.02 (1-2%) | Very small effect | Could determine outcomes in razor-thin margin constituencies | 10-20 seats in closely contested states |
| 0.03-0.05 (3-5%) | Small to moderate effect | Significant enough to change results in swing states | 30-50 seats, potentially determining majority |
| 0.06-0.08 (6-8%) | Moderate to large effect | Substantial swing indicating major political shift | 60-80 seats, clear majority territory |
| > 0.08 (8%+) | Large effect | Landslide victory or major political realignment | 100+ seats, overwhelming majority |
See how different effect sizes translate to political outcomes:
With Δ = 0.05 (5% swing):
To detect Δ = 0.05 with 80% power:
Choosing an appropriate Δ is crucial for designing effective exit polls:
For most national exit polls, Δ between 0.03-0.05 represents a practical balance between statistical precision and political relevance.
We use various visualization methods to represent different types of data and relationships in exit poll analysis.
Purpose: Compare values across categories
Use Case: Party vote share by state
Data Type: Categorical vs. Numerical
Purpose: Show changes over time
Use Case: Voting patterns across elections
Data Type: Temporal vs. Numerical
Purpose: Show frequency distribution
Use Case: Age distribution of voters
Data Type: Numerical (continuous)
Purpose: Show parts of a whole
Use Case: Party vote share percentage
Data Type: Categorical proportions
Purpose: Show correlation between variables
Use Case: Income vs. voting preference
Data Type: Numerical vs. Numerical
Purpose: Show spatial patterns and autocorrelation
Use Case: Regional voting patterns with spatial clustering
Data Type: Geographic coordinates with attribute values
Histogram: Distribution of voter age groups
Bar Chart: Party preferences by state
Pie Chart: Overall vote share distribution
Heat Map: Regional voting patterns with spatial autocorrelation
Line Chart: Trends in voter preferences over time
We create interactive visualizations to allow users to explore data:
We use GIS to create maps that show spatial patterns in voting behavior and analyze spatial autocorrelation:
Spatial autocorrelation measures how similar objects are to nearby objects. In electoral analysis, it helps identify:
We calculate spatial autocorrelation using Moran's I, which measures global spatial autocorrelation:
\[ I = \frac{n}{\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij}} \cdot \frac{\sum_{i=1}^{n} \sum_{j=1}^{n} w_{ij} (x_i - \bar{x}) (x_j - \bar{x})}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \]
Where:
For each region, we calculate weights representing its spatial relationship with all other regions:
| Region Pair | Weight (wᵢⱼ) | Interpretation |
|---|---|---|
| North-North | 0 | No self-relationship |
| North-South | 1 | Strong connection (adjacent) |
| North-East | 0.5 | Moderate connection |
| North-West | 0.5 | Moderate connection |
| North-Central | 1 | Strong connection (adjacent) |
The matrix below shows how each region's deviation from the mean interacts with every other region:
| Region Pair (i,j) | Weight (wᵢⱼ) | Deviation i (xᵢ - x̄) | Deviation j (xⱼ - x̄) | Weighted Product wᵢⱼ·(xᵢ - x̄)·(xⱼ - x̄) |
|---|
Interpretation of Moran's I:
We also use Local Indicators of Spatial Association (LISA) to identify: